ashkamath / mdetr

Apache License 2.0
969 stars 125 forks source link

Issues with training on a single node #101

Open andics opened 11 months ago

andics commented 11 months ago

Hi all,

essentially I spent the day today trying to figure out why the code exits with this error message when ran with 8 GPU-s on one node. This is the command I ran:

`python3 -m torch.distributed.launch --nproc_per_node=8 --master_port=1312 --use_env /home/main.py --dataset_config configs/gqa.json --ema --epochs 10 --do_qa --split_qa_heads --resume https://zenodo.org/record/4721981/files/gqa_resnet101_checkpoint.pth --batch_size 32 --no_aux_loss --no_contrastive_align_loss --qa_loss_coef 25 --lr 1.75e-5 --lr_backbone 3.5e-6 --text_encoder_lr 1.75e-5 --output-dir /home/dir

TERM_THREADLIMIT: job killed after reaching LSF thread limit. Exited with exit code 1.

Resource usage summary:

CPU time :                                   1294.38 sec.
Max Memory :                                 66465 MB
Average Memory :                             3062.34 MB
Total Requested Memory :                     256000.00 MB
Delta Memory :                               189535.00 MB
Max Swap :                                   -
Max Processes :                              35
Max Threads :                                2482
Run time :                                   482 sec.
Turnaround time :                            526 sec.

The output (if any) is above this job summary.`

This is the exit output of the cluster I am running this on. Th thread limit was very high, so that was not the issue. The issue seems to be with the DataLoader: the code seems to be creating way more threads than necessary, resulting in this error. To fix this, you should add a --num_workers 0 argument! Hope that helps someone