Closed lapp0 closed 2 years ago
Users with similar issue (getting stuck, one GPU at 0%, other GPUs at 100%)
https://discuss.pytorch.org/t/nn-dataparallel-gets-stuck/125427/12
https://discuss.pytorch.org/t/training-job-stalls-with-no-logs-gpu-usage-spike/89892/27
https://github.com/pytorch/pytorch/issues/22259
When it hangs
I'm trying to train an instance segmenting model using Mask2Former with a Swin-B (IN21k) backbone.
(full logs)
There are no new
[03/31 00:28:13 d2.utils.events]: eta: ...
logs that come after, and I waited about an hour to see. I've tried multiple times.One time it hanged after iteration
839
, another time it hanged on iteration59
. In the above logs it hanged after iter219
.GPU Utilization
Before it hangs, the 8 GPUs are utilized between 25% and 90% and there is a
python3
process associated with each GPU.When it hangs, 7 GPUs are utilized 100% and one GPU is utilized 0%. There is a
python3
process associated with only 7/8 of the GPUs.**Click to expand GPU info**
### GPU Utilization **Before** it Hangs ``` GPU Memory GPU-Util Tesla V100-SXM2... 13198MiB / 16384MiB | 69% Tesla V100-SXM2... 13226MiB / 16384MiB | 74% Tesla V100-SXM2... 13128MiB / 16384MiB | 90% Tesla V100-SXM2... 13134MiB / 16384MiB | 58% Tesla V100-SXM2... 13206MiB / 16384MiB | 44% Tesla V100-SXM2... 13222MiB / 16384MiB | 27% Tesla V100-SXM2... 13170MiB / 16384MiB | 25% Tesla V100-SXM2... 13142MiB / 16384MiB | 52% ``` ``` +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 18684 C /usr/bin/python3 13195MiB | | 1 N/A N/A 18685 C /usr/bin/python3 13223MiB | | 2 N/A N/A 18686 C /usr/bin/python3 13127MiB | | 3 N/A N/A 18687 C /usr/bin/python3 13131MiB | | 4 N/A N/A 18688 C /usr/bin/python3 13203MiB | | 5 N/A N/A 18689 C /usr/bin/python3 13219MiB | | 6 N/A N/A 18690 C /usr/bin/python3 13167MiB | | 7 N/A N/A 18691 C /usr/bin/python3 13139MiB | +-----------------------------------------------------------------------------+ ``` ### GPU Utilization When it Hangs ``` GPU Memory GPU-Util Tesla V100-SXM2... 13198MiB / 16384MiB | 100% Tesla V100-SXM2... 13226MiB / 16384MiB | 100% Tesla V100-SXM2... 13130MiB / 16384MiB | 100% Tesla V100-SXM2... 13134MiB / 16384MiB | 100% Tesla V100-SXM2... 13206MiB / 16384MiB | 100% Tesla V100-SXM2... 13222MiB / 16384MiB | 100% Tesla V100-SXM2... 339MiB / 16384MiB | 0% Tesla V100-SXM2... 13142MiB / 16384MiB | 100% ``` ### INTERESTING TO NOTE: GPU 6 process is no longer is shown! (`dmesg` doesn't show anything interesting regarding why this process ended, so no OOM I assume) ``` +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 18684 C /usr/bin/python3 13195MiB | | 1 N/A N/A 18685 C /usr/bin/python3 13223MiB | | 2 N/A N/A 18686 C /usr/bin/python3 13127MiB | | 3 N/A N/A 18687 C /usr/bin/python3 13131MiB | | 4 N/A N/A 18688 C /usr/bin/python3 13203MiB | | 5 N/A N/A 18689 C /usr/bin/python3 13219MiB | | 7 N/A N/A 18691 C /usr/bin/python3 13139MiB | +-----------------------------------------------------------------------------+ ```Environment
torch==1.10
cu113
detectron2
fortorch1.10
cu113
Config
My config is based on the provided COCO maskformer2_swin_base_IN21k_384_bs16_50ep.yaml with the following changes:
SOLVER.STEPS
removed from configMODEL.SEM_SEG_HEAD.IGNORE_VALUE
changed to323
MODEL.SEM_SEG_HEAD.NUM_CLASSES
changed to323
INPUT.DATASET_MAPPER_NAME
changed to "mask_former_instance"TEST.EVAL_PERIOD
changed to1000
INPUT.CROP
was added:Run Command
Dataset
https://www.myfoodrepo.org/
Things I've Tried
None of the following changes have fixed the error:
DATALOADER.NUM_WORKERS
from4
to0
SOLVER.BASE_LR
from0.0001
to0.00001
SOLVER.AMP.ENABLED
fromTrue
toFalse
What's the problem? Any idea how to fix?
Any idea what's going wrong here?
It appears that detectron2 already defaults to
logging.DEBUG
. Is there any way I can forcedetectron2
/Mask2Former
to produce more logging details?Is it getting stuck in a training iteration?
SOLVER.MAX_ITER
andSOLVER.STEPS
are relatively high numbers. Although, I don't understand why this would result in termination of one of the GPU processes. However I don't understand pytorch internals well.What further debug steps would you recommend?