How to train HAMER with slurm?

geopavlakos / hamer

HaMeR: Reconstructing Hands in 3D with Transformers

https://geopavlakos.github.io/hamer/

MIT License

326 stars 28 forks source link

How to train HAMER with slurm? #31

Closed shenwenhao01 closed 5 months ago

shenwenhao01 commented 5 months ago

I am training HAMER on a cluster using the following command:

srun -p ${PARTITION} \ -f pt \ --workspace-id ${WORK_SPACE} \ --resource ${RESOURCE} \ --job-name ${JOB_NAME} \ --container-image ${CONTAINER_IMAGE} \ sudo -E -u shenwenhao bash -c " cd /mnt/AFS_shenwenhao/hamer && /mnt/AFS_shenwenhao/.conda/envs/py10/bin/python train.py \ exp_name=hamer_finetune_arctic \ data=mix_all \ experiment=hamer_vit_transformer \ launcher=local"

But I found that the model checkpoint is not saved and the tesnorboard log is also null.

I notice that there are some config files related to training with slurm, but I really got into some trouble when I train HAMER with slurm. Could you please add some documents to show how to train HAMER with slurm?

shenwenhao01 commented 5 months ago

There are also some problems when I am using ddp without slurm. I simply use the command:

python train.py exp_name=hamer data=mix_all experiment=hamer_vit_transformer launcher=local

to train with ddp, but it reports:

torch.multiprocessing.spawn.ProcessRaisedException: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string valuestrategy='ddp_find_unused_parameters_true'or by setting the flag in the strategy withstrategy=DDPStrategy(find_unused_parameters=True).

I changed the strategy 'ddp' in ddp.yaml to 'ddp_find_unused_parameters_true', but I do not know if this will influence the training.

geopavlakos commented 5 months ago

Regarding slurm, one way to launch is by doing minor modifications to the command we provide, i.e., python train.py -m hydra/launcher=submitit_slurm exp_name=hamer data=mix_all experiment=hamer_vit_transformer trainer=gpu launcher=slurm

Regarding the trainer strategy for multi-gpu training, this is also what we used, so it shouldn't influence training in a negative way.