Training error when batch is 64

YoungJoongUNC commented 4 years ago

Hello.

When I run the train.py with batch size 1, then it works fine. But when I use the batch size of 64 (the default value), it throws error like below. May I ask how could I fix this?

THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=88 error=10 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 102, in <module>
    main()
  File "train.py", line 100, in main
    solver.train(train_dataset, val_dataset)
  File "/playpen/youngjoong/code/TDAN-VSR/solver.py", line 412, in train
    self._epoch_step(train_dataset, epoch)
  File "/playpen/youngjoong/code/TDAN-VSR/solver.py", line 228, in _epoch_step
    output_batch, lrs = self.model(input_batch)
  File "/home/youngjoong/anaconda3/envs/deform/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/youngjoong/anaconda3/envs/deform/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 69, in forward
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
  File "/home/youngjoong/anaconda3/envs/deform/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 80, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  File "/home/youngjoong/anaconda3/envs/deform/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 38, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
  File "/home/youngjoong/anaconda3/envs/deform/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 31, in scatter
    return scatter_map(inputs)
  File "/home/youngjoong/anaconda3/envs/deform/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 18, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/youngjoong/anaconda3/envs/deform/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
    return Scatter.apply(target_gpus, None, dim, obj)
  File "/home/youngjoong/anaconda3/envs/deform/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 74, in forward
    outputs = comm.scatter(input, ctx.target_gpus, ctx.chunk_sizes, ctx.dim, streams)
  File "/home/youngjoong/anaconda3/envs/deform/lib/python3.6/site-packages/torch/cuda/comm.py", line 188, in scatter
    with torch.cuda.device(device), torch.cuda.stream(stream):
  File "/home/youngjoong/anaconda3/envs/deform/lib/python3.6/site-packages/torch/cuda/__init__.py", line 209, in __enter__
    torch._C._cuda_setDevice(self.idx)
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:88

YapengTian commented 4 years ago

Is this memory issue? A smaller batchsize or more gpus would be helpful.

YoungJoongUNC commented 4 years ago

If I do nvidia-smi, then the memory is plenty. If I use batch size = gpu 1 , then It works fine. But If I use batch size = gpu 2, then It does not work and shows the above error. Have you also encountered this? Did you use 8 gpu with 64 batch size?

YoungJoongUNC commented 4 years ago

May I ask your GPU spec and how many GPUs and batch size did you use while training?

YapengTian commented 4 years ago

For the uploaded, I trained it with 8 GPUs and 64 batchsize.

YapengTian / TDAN-VSR-CVPR-2020

Training error when batch is 64 #17