Training Error - Githubissues

fuqianya commented 2 years ago

Hi, authors

I follow the README to train GNR, but found the following error:

INFO:root:train data size: 7200
INFO:root:test data size: 10
INFO:root:render data size: 10
INFO:root:Using Network: gnr
INFO:root:use Data Parallel...
  0%|                                                                 | 0/1000 [00:06<?, ?it/s]
Traceback (most recent call last):
  File "apps/run_genebody.py", line 333, in <module>
    train(opt)
  File "apps/run_genebody.py", line 176, in train
    loss_dict = net(data, train_shape=train_shape)
  File "/home/fuqian/Downloads/Software/anaconda3/envs/2022-GNR-Arxiv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/fuqian/Downloads/Software/anaconda3/envs/2022-GNR-Arxiv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/fuqian/Downloads/Software/anaconda3/envs/2022-GNR-Arxiv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/fuqian/Downloads/Software/anaconda3/envs/2022-GNR-Arxiv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/home/fuqian/Downloads/Software/anaconda3/envs/2022-GNR-Arxiv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/home/fuqian/Downloads/Software/anaconda3/envs/2022-GNR-Arxiv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/fuqian/Documents/Research/View-Synthesis/2022-GNR-Arxiv/gnr/lib/model/GNR.py", line 62, in forward
    error = self.nerf_renderer.render(**data)
  File "/home/fuqian/Documents/Research/View-Synthesis/2022-GNR-Arxiv/gnr/lib/model/NeRFRenderer.py", line 419, in render
    calibs[:self.num_views], smpl, mesh_param, scan, persps)
  File "/home/fuqian/Documents/Research/View-Synthesis/2022-GNR-Arxiv/gnr/lib/model/NeRFRenderer.py", line 302, in render_rays
    inside, smpl_vis, scan_vis = self.inside_pts_vh(pts, masks, smpl, calibs, persps)
  File "/home/fuqian/Documents/Research/View-Synthesis/2022-GNR-Arxiv/gnr/lib/model/NeRFRenderer.py", line 376, in inside_pts_vh
    inside = index(masks, xy, 'nearest')
  File "/home/fuqian/Documents/Research/View-Synthesis/2022-GNR-Arxiv/gnr/lib/geometry.py", line 77, in index
    samples = torch.nn.functional.grid_sample(feat, uv, mode=mode)
  File "/home/fuqian/Downloads/Software/anaconda3/envs/2022-GNR-Arxiv/lib/python3.6/site-packages/torch/nn/functional.py", line 2717, in grid_sample
    return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum)
RuntimeError: grid_sampler(): expected grid and input to have same batch size, but got input with sizes [1, 1, 512, 512] and grid with sizes [2, 262144, 1, 2]
epoch 0/1000:   0%|                                                   | 0/7200 [00:06<?, ?it/s]

I only use one person to train, so the train data size is 7200. But this error is irrelevant with my training data. Can you train GNR with the code in this repo with the following command:

python apps/run_genebody.py --config configs/train.txt --dataroot ${GENEBODY_ROOT}

generalizable-neural-performer commented 2 years ago

How many GPUs do you have in your machine? If you have multiple devices and you want to use data parallel, it is recommended to use only 1 GPU, e.g

CUDA_VISIBLE_DEVICES=0 python apps/run_genebody.py --config configs/train.txt --dataroot ${GENEBODY_ROOT}

Another optional training strategy is to use distributed data parallel in multi-device machine or multiple machines. You can try scripts/train_ddp.sh on your machine.

fuqianya commented 2 years ago

Thanks for your quick rely! Use only 1 GPU fix this bug. Thanks ~

generalizable-neural-performer / gnr

Training Error #8