lapp0 commented 2 years ago

When it hangs

I'm trying to train an instance segmenting model using Mask2Former with a Swin-B (IN21k) backbone.

/home/ubuntu/.local/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
[03/31 00:26:38 d2.utils.events]:  eta: 4 days, 11:18:51  iter: 19  total_loss: 99.75  loss_ce: 2.976  loss_mask: 3.272  loss_dice: 3.015  loss_ce_0: 11.88  loss_mask_0: 2.624  loss_dice_0: 2.985  loss_ce_1: 3.312  loss_mask_1: 2.749  loss_dice_1: 2.871  loss_ce_2: 2.883  loss_mask_2: 2.787  loss_dice_2: 2.95  loss_ce_3: 2.882  loss_mask_3: 3.024  loss_dice_3: 3.011  loss_ce_4: 2.891  loss_mask_4: 3.13  loss_dice_4: 2.937  loss_ce_5: 2.951  loss_mask_5: 3.171  loss_dice_5: 2.953  loss_ce_6: 2.968  loss_mask_6: 3.257  loss_dice_6: 3.014  loss_ce_7: 3.011  loss_mask_7: 3.291  loss_dice_7: 2.986  loss_ce_8: 2.944  loss_mask_8: 3.25  loss_dice_8: 3.013  time: 0.9266  data_time: 0.6759  lr: 1e-05  max_mem: 10432M
[03/31 00:26:49 d2.utils.events]:  eta: 2 days, 19:51:57  iter: 39  total_loss: 88.92  loss_ce: 2.551  loss_mask: 3.236  loss_dice: 2.81  loss_ce_0: 11.84  loss_mask_0: 2.241  loss_dice_0: 2.595  loss_ce_1: 2.68  loss_mask_1: 2.258  loss_dice_1: 2.642  loss_ce_2: 2.583  loss_mask_2: 2.25  loss_dice_2: 2.736  loss_ce_3: 2.534  loss_mask_3: 2.37  loss_dice_3: 2.764  loss_ce_4: 2.488  loss_mask_4: 2.436  loss_dice_4: 2.773  loss_ce_5: 2.495  loss_mask_5: 2.499  loss_dice_5: 2.742  loss_ce_6: 2.508  loss_mask_6: 2.822  loss_dice_6: 2.729  loss_ce_7: 2.511  loss_mask_7: 3.066  loss_dice_7: 2.752  loss_ce_8: 2.516  loss_mask_8: 3.234  loss_dice_8: 2.776  time: 0.7301  data_time: 0.0112  lr: 1e-05  max_mem: 10432M

<redundant logs removed for conciseness>

[03/31 00:27:56 d2.utils.events]:  eta: 1 day, 14:43:59  iter: 179  total_loss: 69.42  loss_ce: 2.186  loss_mask: 2.038  loss_dice: 1.908  loss_ce_0: 11.49  loss_mask_0: 1.89  loss_dice_0: 1.81  loss_ce_1: 2.277  loss_mask_1: 1.935  loss_dice_1: 1.681  loss_ce_2: 2.278  loss_mask_2: 1.917  loss_dice_2: 1.691  loss_ce_3: 2.303  loss_mask_3: 1.986  loss_dice_3: 1.682  loss_ce_4: 2.336  loss_mask_4: 2.026  loss_dice_4: 1.829  loss_ce_5: 2.376  loss_mask_5: 1.96  loss_dice_5: 1.71  loss_ce_6: 2.263  loss_mask_6: 2.116  loss_dice_6: 1.769  loss_ce_7: 2.205  loss_mask_7: 2.084  loss_dice_7: 1.849  loss_ce_8: 2.197  loss_mask_8: 2.04  loss_dice_8: 1.757  time: 0.5221  data_time: 0.0120  lr: 1e-05  max_mem: 10781M
[03/31 00:28:05 d2.utils.events]:  eta: 1 day, 14:49:30  iter: 199  total_loss: 71.78  loss_ce: 2.394  loss_mask: 1.906  loss_dice: 1.883  loss_ce_0: 11.46  loss_mask_0: 1.814  loss_dice_0: 1.933  loss_ce_1: 2.344  loss_mask_1: 1.8  loss_dice_1: 1.826  loss_ce_2: 2.473  loss_mask_2: 1.766  loss_dice_2: 1.754  loss_ce_3: 2.501  loss_mask_3: 1.891  loss_dice_3: 1.862  loss_ce_4: 2.41  loss_mask_4: 1.886  loss_dice_4: 1.865  loss_ce_5: 2.489  loss_mask_5: 1.886  loss_dice_5: 1.803  loss_ce_6: 2.421  loss_mask_6: 1.902  loss_dice_6: 1.853  loss_ce_7: 2.423  loss_mask_7: 2.005  loss_dice_7: 1.856  loss_ce_8: 2.459  loss_mask_8: 1.94  loss_dice_8: 1.851  time: 0.5158  data_time: 0.0113  lr: 1e-05  max_mem: 10781M
[03/31 00:28:13 d2.utils.events]:  eta: 1 day, 14:44:40  iter: 219  total_loss: 72.38  loss_ce: 2.343  loss_mask: 2.006  loss_dice: 1.857  loss_ce_0: 11.39  loss_mask_0: 1.82  loss_dice_0: 1.875  loss_ce_1: 2.297  loss_mask_1: 1.814  loss_dice_1: 1.785  loss_ce_2: 2.293  loss_mask_2: 1.943  loss_dice_2: 1.769  loss_ce_3: 2.355  loss_mask_3: 1.968  loss_dice_3: 1.79  loss_ce_4: 2.489  loss_mask_4: 1.866  loss_dice_4: 1.726  loss_ce_5: 2.504  loss_mask_5: 1.905  loss_dice_5: 1.809  loss_ce_6: 2.339  loss_mask_6: 1.907  loss_dice_6: 1.812  loss_ce_7: 2.366  loss_mask_7: 1.935  loss_dice_7: 1.928  loss_ce_8: 2.388  loss_mask_8: 2.018  loss_dice_8: 1.941  time: 0.5058  data_time: 0.0126  lr: 1e-05  max_mem: 10781M
/home/ubuntu/.local/lib/python3.8/site-packages/detectron2/structures/image_list.py:88: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  max_size = (max_size + (stride - 1)) // stride * stride
/home/ubuntu/Mask2Former/mask2former/modeling/transformer_decoder/position_encoding.py:41: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)

(full logs)

There are no new [03/31 00:28:13 d2.utils.events]: eta: ... logs that come after, and I waited about an hour to see. I've tried multiple times.

One time it hanged after iteration 839, another time it hanged on iteration 59. In the above logs it hanged after iter 219.

GPU Utilization

Before it hangs, the 8 GPUs are utilized between 25% and 90% and there is a python3 process associated with each GPU.

When it hangs, 7 GPUs are utilized 100% and one GPU is utilized 0%. There is a python3 process associated with only 7/8 of the GPUs.

**Click to expand GPU info**

### GPU Utilization **Before** it Hangs ``` GPU Memory GPU-Util Tesla V100-SXM2... 13198MiB / 16384MiB | 69% Tesla V100-SXM2... 13226MiB / 16384MiB | 74% Tesla V100-SXM2... 13128MiB / 16384MiB | 90% Tesla V100-SXM2... 13134MiB / 16384MiB | 58% Tesla V100-SXM2... 13206MiB / 16384MiB | 44% Tesla V100-SXM2... 13222MiB / 16384MiB | 27% Tesla V100-SXM2... 13170MiB / 16384MiB | 25% Tesla V100-SXM2... 13142MiB / 16384MiB | 52% ``` ``` +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 18684 C /usr/bin/python3 13195MiB | | 1 N/A N/A 18685 C /usr/bin/python3 13223MiB | | 2 N/A N/A 18686 C /usr/bin/python3 13127MiB | | 3 N/A N/A 18687 C /usr/bin/python3 13131MiB | | 4 N/A N/A 18688 C /usr/bin/python3 13203MiB | | 5 N/A N/A 18689 C /usr/bin/python3 13219MiB | | 6 N/A N/A 18690 C /usr/bin/python3 13167MiB | | 7 N/A N/A 18691 C /usr/bin/python3 13139MiB | +-----------------------------------------------------------------------------+ ``` ### GPU Utilization When it Hangs ``` GPU Memory GPU-Util Tesla V100-SXM2... 13198MiB / 16384MiB | 100% Tesla V100-SXM2... 13226MiB / 16384MiB | 100% Tesla V100-SXM2... 13130MiB / 16384MiB | 100% Tesla V100-SXM2... 13134MiB / 16384MiB | 100% Tesla V100-SXM2... 13206MiB / 16384MiB | 100% Tesla V100-SXM2... 13222MiB / 16384MiB | 100% Tesla V100-SXM2... 339MiB / 16384MiB | 0% Tesla V100-SXM2... 13142MiB / 16384MiB | 100% ``` ### INTERESTING TO NOTE: GPU 6 process is no longer is shown! (`dmesg` doesn't show anything interesting regarding why this process ended, so no OOM I assume) ``` +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 18684 C /usr/bin/python3 13195MiB | | 1 N/A N/A 18685 C /usr/bin/python3 13223MiB | | 2 N/A N/A 18686 C /usr/bin/python3 13127MiB | | 3 N/A N/A 18687 C /usr/bin/python3 13131MiB | | 4 N/A N/A 18688 C /usr/bin/python3 13203MiB | | 5 N/A N/A 18689 C /usr/bin/python3 13219MiB | | 7 N/A N/A 18691 C /usr/bin/python3 13139MiB | +-----------------------------------------------------------------------------+ ```

Environment

Based on "Deep Learning AMI GPU CUDA 11.3.1 (Ubuntu 20.04) 20220303"
torch==1.10 cu113
detectron2 for torch1.10 cu113

Config

My config is based on the provided COCO maskformer2_swin_base_IN21k_384_bs16_50ep.yaml with the following changes:

SOLVER.STEPS removed from config
MODEL.SEM_SEG_HEAD.IGNORE_VALUE changed to 323
MODEL.SEM_SEG_HEAD.NUM_CLASSES changed to 323
INPUT.DATASET_MAPPER_NAME changed to "mask_former_instance"
TEST.EVAL_PERIOD changed to 1000

INPUT.CROP was added:

CROP:
ENABLED: True
TYPE: "absolute"
SIZE: (640, 640)
SINGLE_CATEGORY_MAX_AREA: 1.0

Run Command

DETECTRON2_DATASETS=/home/ubuntu/detectron_datasets python3 train_net.py --num-gpus 8 \
  --config-file configs/food/instance-segmentation/swin/maskformer2_swin_base_IN21k_384_bs16_160k.yaml

Dataset

https://www.myfoodrepo.org/

Things I've Tried

None of the following changes have fixed the error:

Change DATALOADER.NUM_WORKERS from 4 to 0
Change SOLVER.BASE_LR from 0.0001 to 0.00001
Change SOLVER.AMP.ENABLED from True to False
Use 8 A100 GPUs
Use 8 V100 GPUs

What's the problem? Any idea how to fix?

Any idea what's going wrong here?

It appears that detectron2 already defaults to logging.DEBUG. Is there any way I can force detectron2 / Mask2Former to produce more logging details?

Is it getting stuck in a training iteration? SOLVER.MAX_ITER and SOLVER.STEPS are relatively high numbers. Although, I don't understand why this would result in termination of one of the GPU processes. However I don't understand pytorch internals well.

What further debug steps would you recommend?

lapp0 commented 2 years ago

Users with similar issue (getting stuck, one GPU at 0%, other GPUs at 100%)

https://discuss.pytorch.org/t/nn-dataparallel-gets-stuck/125427/12