essentially I spent the day today trying to figure out why the code exits with this error message when ran with 8 GPU-s on one node. This is the command I ran:
TERM_THREADLIMIT: job killed after reaching LSF thread limit.
Exited with exit code 1.
Resource usage summary:
CPU time : 1294.38 sec.
Max Memory : 66465 MB
Average Memory : 3062.34 MB
Total Requested Memory : 256000.00 MB
Delta Memory : 189535.00 MB
Max Swap : -
Max Processes : 35
Max Threads : 2482
Run time : 482 sec.
Turnaround time : 526 sec.
The output (if any) is above this job summary.`
This is the exit output of the cluster I am running this on. Th thread limit was very high, so that was not the issue. The issue seems to be with the DataLoader: the code seems to be creating way more threads than necessary, resulting in this error. To fix this, you should add a --num_workers 0 argument! Hope that helps someone
Hi all,
essentially I spent the day today trying to figure out why the code exits with this error message when ran with 8 GPU-s on one node. This is the command I ran:
`python3 -m torch.distributed.launch --nproc_per_node=8 --master_port=1312 --use_env /home/main.py --dataset_config configs/gqa.json --ema --epochs 10 --do_qa --split_qa_heads --resume https://zenodo.org/record/4721981/files/gqa_resnet101_checkpoint.pth --batch_size 32 --no_aux_loss --no_contrastive_align_loss --qa_loss_coef 25 --lr 1.75e-5 --lr_backbone 3.5e-6 --text_encoder_lr 1.75e-5 --output-dir /home/dir
TERM_THREADLIMIT: job killed after reaching LSF thread limit. Exited with exit code 1.
Resource usage summary:
The output (if any) is above this job summary.`
This is the exit output of the cluster I am running this on. Th thread limit was very high, so that was not the issue. The issue seems to be with the DataLoader: the code seems to be creating way more threads than necessary, resulting in this error. To fix this, you should add a --num_workers 0 argument! Hope that helps someone