Hello,
Fabulous idea! And this is also a great work to show how to use habitat to do research.
But that seems like your code is designed for one gpu to train the model. (Following is from your training script)
#!/bin/bash
#SBATCH --job-name=onav_bc
#SBATCH -p gpu_p13 # Name of the partition
#SBATCH -C v100-32g
#SBATCH --qos=qos_gpu-t4
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks-per-node=1 # Number of GPUs on a node
#SBATCH --gres gpu:1 # The same
#SBATCH -c 10 # Number of workers per GPU
#SBATCH --hint=nomultithread # Logical cores
#SBATCH --time 50:00:00 # Total run time limit (HH:MM:SS)
#SBATCH --output=slurm_logs/%j.out # Output file name
#SBATCH --error=slurm_logs/%j.out # Error file name
When I am training model in 2 gpus, there is following mistakes. The data in batch['demonstration'] always have double length of batch size. That means when I set batch size to 32, then the batch['demonstration'] 's first dimension will be 64. And other section in batch is 32. Maybe something wrong in dataloader?
I will look into it. Maybe we can do it together!
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/opt/conda/envs/onav1/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/opt/conda/envs/onav1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/onav_rim/offline_bc/models/onav_imap_models.py", line 165, in forward
inputs = self.encode_step_obs(batch, step_embeddings=stepid_embeds)
File "/home/onav_rim/offline_bc/models/onav_base.py", line 520, in encode_step_obs
return self.encode_step_obs_concat(batch, **kwargs)
File "/home/onav_rim/offline_bc/models/onav_base.py", line 441, in encode_step_obs_concat
prev_actions[:, 1:] = batch['demonstration'][:, :-1]
RuntimeError: The expanded size of the tensor (32) must match the existing size (16) at non-singleton dimension 0. Target sizes: [32, 190]. Tensor sizes: [16, 190]
Hello, Fabulous idea! And this is also a great work to show how to use habitat to do research. But that seems like your code is designed for one gpu to train the model. (Following is from your training script)
When I am training model in 2 gpus, there is following mistakes. The data in
batch['demonstration']
always have double length of batch size. That means when I set batch size to 32, then thebatch['demonstration']
's first dimension will be 64. And other section in batch is 32. Maybe something wrong in dataloader? I will look into it. Maybe we can do it together!