facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.48k stars 2.09k forks source link

TorchAgent compatibility with HogwildWorld #1269

Closed bhancock8 closed 5 years ago

bhancock8 commented 5 years ago

I'm having trouble using a TorchAgent with HogwildWorld. It looks to me like the HogwildWorld is used when --numthreads > 1, but adding "--numthreads 2" to the example command on the main repo README:

python examples/train_model.py -t babi:task10k:1 -m seq2seq -mf /tmp/model_s2s -bs 32 -vtim 30 -vcut 0.95 --numthreads 2

yields the following error:

[creating task(s): babi:task10k:1]
[loading fbdialog data:/private/home/bhancock/ParlAI/data/bAbI/tasks_1-20_v1-2/en-valid-10k-nosf/qa1_train.txt]
[ thread 0 initialized ]
[ thread 1 initialized ]
[ training... ]
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCCachingAllocator.cpp line=507 error=3 : initialization error
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCCachingAllocator.cpp line=507 error=3 : initialization error
Process HogwildProcess-1:
Process HogwildProcess-2:
Traceback (most recent call last):
  File "/private/home/bhancock/.conda/envs/bjh_fair_env/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/private/home/bhancock/ParlAI/parlai/core/worlds.py", line 772, in run
    world.parley()
  File "/private/home/bhancock/ParlAI/parlai/core/worlds.py", line 651, in parley
    batch_act = self.batch_act(agent_idx, batch_observations[agent_idx])
  File "/private/home/bhancock/ParlAI/parlai/core/worlds.py", line 624, in batch_act
    batch_actions = a.batch_act(batch_observation)
  File "/private/home/bhancock/ParlAI/parlai/core/torch_agent.py", line 896, in batch_act
    batch = self.batchify(observations)
  File "/private/home/bhancock/ParlAI/parlai/agents/seq2seq/seq2seq.py", line 321, in batchify
    return super().batchify(*args, **kwargs)
  File "/private/home/bhancock/ParlAI/parlai/core/torch_agent.py", line 641, in batchify
    xs, x_lens = padded_tensor(_xs, self.NULL_IDX, self.use_cuda)
  File "/private/home/bhancock/ParlAI/parlai/core/utils.py", line 994, in padded_tensor
    output = output.cuda()
RuntimeError: cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCCachingAllocator.cpp:507
Traceback (most recent call last):
  File "/private/home/bhancock/.conda/envs/bjh_fair_env/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/private/home/bhancock/ParlAI/parlai/core/worlds.py", line 772, in run
    world.parley()
  File "/private/home/bhancock/ParlAI/parlai/core/worlds.py", line 651, in parley
    batch_act = self.batch_act(agent_idx, batch_observations[agent_idx])
  File "/private/home/bhancock/ParlAI/parlai/core/worlds.py", line 624, in batch_act
    batch_actions = a.batch_act(batch_observation)
  File "/private/home/bhancock/ParlAI/parlai/core/torch_agent.py", line 896, in batch_act
    batch = self.batchify(observations)
  File "/private/home/bhancock/ParlAI/parlai/agents/seq2seq/seq2seq.py", line 321, in batchify
    return super().batchify(*args, **kwargs)
  File "/private/home/bhancock/ParlAI/parlai/core/torch_agent.py", line 641, in batchify
    xs, x_lens = padded_tensor(_xs, self.NULL_IDX, self.use_cuda)
  File "/private/home/bhancock/ParlAI/parlai/core/utils.py", line 994, in padded_tensor
    output = output.cuda()
RuntimeError: cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCCachingAllocator.cpp:507
bhancock8 commented 5 years ago

Not a big deal. I'll figure this out internally.

stephenroller commented 5 years ago

For posterity, the solution is to add --no-cuda to the call. Hogwild requires cpu-only for the moment.

alexholdenmiller commented 5 years ago

yeah basically pytorch requires the start method to be changed and you have to carefully control where this happens, for example by adding to the front of examples/train_model.py:

if __name__ == '__main__':
    import torch
    from torch import multiprocessing
    multiprocessing.set_start_method('spawn')

that's worked in the past, although now there's a bug that I don't understand why this would get flagged: raise RuntimeError("Cannot pickle CUDA storage; try pickling a CUDA tensor instead")

However, the other start methods can be extremely slow, so I wouldn't necessary want to do cuda multithreaded anyways.