cshizhe / onav_rim

MIT License
32 stars 1 forks source link

Can not work when I am using multiple GPUS #4

Open goodstudent9 opened 6 months ago

goodstudent9 commented 6 months ago

Hello, Fabulous idea! And this is also a great work to show how to use habitat to do research. But that seems like your code is designed for one gpu to train the model. (Following is from your training script)

#!/bin/bash
#SBATCH --job-name=onav_bc
#SBATCH -p gpu_p13               # Name of the partition 
#SBATCH -C v100-32g 
#SBATCH --qos=qos_gpu-t4
#SBATCH --nodes=1                # Number of nodes
#SBATCH --ntasks-per-node=1      # Number of GPUs on a node
#SBATCH --gres gpu:1             # The same
#SBATCH -c 10                    # Number of workers per GPU
#SBATCH --hint=nomultithread     # Logical cores
#SBATCH --time 50:00:00          # Total run time limit (HH:MM:SS)
#SBATCH --output=slurm_logs/%j.out # Output file name
#SBATCH --error=slurm_logs/%j.out  # Error file name

When I am training model in 2 gpus, there is following mistakes. The data in batch['demonstration'] always have double length of batch size. That means when I set batch size to 32, then the batch['demonstration'] 's first dimension will be 64. And other section in batch is 32. Maybe something wrong in dataloader? I will look into it. Maybe we can do it together!

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/opt/conda/envs/onav1/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/envs/onav1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/onav_rim/offline_bc/models/onav_imap_models.py", line 165, in forward
    inputs = self.encode_step_obs(batch, step_embeddings=stepid_embeds)
  File "/home/onav_rim/offline_bc/models/onav_base.py", line 520, in encode_step_obs
    return self.encode_step_obs_concat(batch, **kwargs)
  File "/home/onav_rim/offline_bc/models/onav_base.py", line 441, in encode_step_obs_concat
    prev_actions[:, 1:] = batch['demonstration'][:, :-1]
RuntimeError: The expanded size of the tensor (32) must match the existing size (16) at non-singleton dimension 0.  Target sizes: [32, 190].  Tensor sizes: [16, 190]