Program turns into zombie process when killed using `ctrl-c`

yxchng commented 5 years ago

🐛 Bug

0% utilization in second GPU in 2x GPUs training

Screenshot from 2019-04-10 17-25-07

Is the second GPUs only used to store tensors? Is the multi GPUs training in this codebase specially implemented, such that it is different from the multi GPUs training in PyTorch?

To Reproduce

Run training code with 2 GPUs

Expected behavior

Comparable utilization in 2 GPUs?

Environment

PyTorch version: 1.0.0.dev20190409 Is debug build: No CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 16.04.4 LTS GCC version: (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010 CMake version: version 3.5.1

Python version: 3.5 Is CUDA available: Yes CUDA runtime version: 10.0.130 GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: TITAN X (Pascal)

Nvidia driver version: 418.39 cuDNN version: Could not collect

Versions of relevant libraries: [pip] Could not collect [conda] Could not collect Pillow (6.0.0)

UPDATE: Note that this is actually a wrong description of the problem but is still kept here just to keep the flow. The correct description of the problem is in the post below.

fmassa commented 5 years ago

How did you launch your 2 GPU job? This behavior is not expected.

fmassa commented 5 years ago

Also, I just noticed that you have two different GPUs. What might be happening is that the fastest GPU is waiting for the slowest GPU to finish its iteration.

It seems that 2080Ti do not have peer2peer enabled, which can make multi-GPU training much slower as memory transfer between GPUs should pass via the CPU

https://www.pugetsystems.com/labs/hpc/P2P-peer-to-peer-on-NVIDIA-RTX-2080Ti-vs-GTX-1080Ti-GPUs-1331/

yxchng commented 5 years ago

I reinstalled NVIDIA driver and installed the latest pytorch-nightly and the problem disappears.

yxchng commented 5 years ago

@fmassa My previous assessment of the problem was wrong. The actual problem is that the program turns into zombie process often when I ctrl-c to kill it, meaning it is no longer running but still hogging the memory and appears in top and nvidia-smi. The 100% utilization displayed in nvidia-smi is misleading because the program has already stopped. I always have to kill each started process manually by their PIDs using the kill command. Sometimes, even killing doesn't work. In those cases, I can only reboot my computer.

I run my GPU job using the command

NGPU=2
python -m torch.distributed.launch --nproc_per_node=$NGPU tools/train_net.py --config-file configs/<...>

One of the config files I used is as such

MODEL:
  META_ARCHITECTURE: "GeneralizedRCNN"
  WEIGHT: "catalog://ImageNetPretrained/FAIR/20171220/X-101-32x8d"
  BACKBONE:
    CONV_BODY: "R-101-FPN"
  RPN:
    USE_FPN: True
    ANCHOR_STRIDE: (4, 8, 16, 32, 64)
    PRE_NMS_TOP_N_TRAIN: 2000
    PRE_NMS_TOP_N_TEST: 1000
    POST_NMS_TOP_N_TEST: 1000
    FPN_POST_NMS_TOP_N_TEST: 1000
  ROI_HEADS:
    USE_FPN: True
  ROI_BOX_HEAD:
    POOLER_RESOLUTION: 7
    POOLER_SCALES: (0.25, 0.125, 0.0625, 0.03125)
    POOLER_SAMPLING_RATIO: 2
    FEATURE_EXTRACTOR: "FPN2MLPFeatureExtractor"
    PREDICTOR: "FPNPredictor"
  RESNETS:
    BACKBONE_OUT_CHANNELS: 256
    STRIDE_IN_1X1: False
    NUM_GROUPS: 32
    WIDTH_PER_GROUP: 8
DATASETS:
  TRAIN: ("crowdhuman_train", )
  TEST: ("crowdhuman_val",)
DATALOADER:
  SIZE_DIVISIBILITY: 32
SOLVER:
  BASE_LR: 0.02
  WEIGHT_DECAY: 0.0001
  STEPS: (60000, 80000)
  MAX_ITER: 90000
  IMS_PER_BATCH: 2
TEST:
  IMS_PER_BATCH: 2
INPUT:
  MIN_SIZE_TRAIN: (800,)
  MAX_SIZE_TRAIN: 1333
  MIN_SIZE_TEST: 800
  MAX_SIZE_TEST: 1333
OUTPUT_DIR: "results/exp2"

I have tried testing with other configs as well and the problem remains.

I am quite sure there is a bug in the code because this has happened in 2 different computers (I tried running it on AWS using 2x P100s as well).

Environment on AWS

PyTorch version: 1.1.0a0+be364ac Is debug build: No CUDA used to build PyTorch: 10.1.105

OS: Ubuntu 16.04.5 LTS GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609 CMake version: version 3.5.1

Python version: 3.6 Is CUDA available: Yes CUDA runtime version: 10.1.105 GPU models and configuration: GPU 0: Tesla P100-PCIE-16GB GPU 1: Tesla P100-PCIE-16GB

Nvidia driver version: 410.104 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.5.0

Versions of relevant libraries: [pip] msgpack-numpy==0.4.3.2 [pip] numpy==1.16.2 [pip] torch==1.1.0a0+be364ac [pip] torchtext==0.4.0 [pip] torchvision==0.2.1 [conda] blas 1.0 mkl anaconda [conda] magma-cuda100 2.1.0 5 local [conda] mkl 2019.1 144 [conda] mkl-include 2019.1 144 [conda] mkl_fft 1.0.10 py36ha843d7b_0 anaconda [conda] mkl_random 1.0.2 py36hd81dba3_0 anaconda [conda] torch 1.1.0a0+be364ac pypi_0 pypi [conda] torchtext 0.4.0 pypi_0 pypi [conda] torchvision 0.2.1 pypi_0 pypi Pillow (5.3.0.post0)

I thought I have solved it but apparently not.

fmassa commented 5 years ago

This is a problem with the cleanup in PyTorch distributed launch utility, when one of the process dies the others might not be killed.

ccing @pietern to know if he has ideas on how to avoid this situation.

chengyangfu commented 5 years ago

If you use ctrl-c to stop the program, be careful to kill every process carefully. In your case(2gpus), there are around 2 + 8(dataloading) processes. I usually will run ps aux | grep python to kill everything related to the training program.

yxchng commented 5 years ago

@chengyangfu My expectation is that the ctrl-c signal should propagate to every process and I shouldn't have to manually kill all of them. A good implementation of multiprocessing code should not have such problem and so I would think that this is a bug. I have not have time to read through the code yet but is this library just using tools provided by PyTorch such that the problem lies in PyTorch? It is still strange though because I have been using PyTorch's DataParallel all the time in my other multi-GPUs training code and I have not met such problem.

I was browsing through the issues and seems like this issue https://github.com/facebookresearch/maskrcnn-benchmark/issues/58 is related to the problem discussed here. Their root problem is probably the same where the coordination and communication among the many launched processes are problematic.

ray-lee-94 commented 5 years ago

I meet the same problem

sanshibayuan commented 5 years ago

Same here

Marcovaldong commented 5 years ago

I met a similar problem. I trained the model with 4 gpus. Training for thousand mini-batches, one process dead (I cannot get when and how it dead) and the utilization of the other three gpus are maintained at 100%, but the training has been stopped.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26                 Driver Version: 387.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 20%   27C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 20%   32C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:08:00.0 Off |                  N/A |
| 20%   28C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 20%   30C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  Off  | 00000000:85:00.0 Off |                  N/A |
| 24%   58C    P2    77W / 250W |   3764MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  Off  | 00000000:86:00.0 Off |                  N/A |
| 20%   54C    P2    78W / 250W |   4110MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 108...  Off  | 00000000:89:00.0 Off |                  N/A |
| 20%   29C    P8    15W / 250W |     41MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 108...  Off  | 00000000:8A:00.0 Off |                  N/A |
| 24%   58C    P2    74W / 250W |   3906MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    4     65245      C   ...ongsq/environments/anaconda3/bin/python  3731MiB |
|    5     65246      C   ...ongsq/environments/anaconda3/bin/python  4077MiB |
|    7     65248      C   ...ongsq/environments/anaconda3/bin/python  3873MiB |
+-----------------------------------------------------------------------------+

As shown above, the process which pid should be 65247 has been killed for some reason, How should I fix this problem? I cannot reinstall nvidia driver because of no root right.

pietern commented 5 years ago

@Marcovaldong This is not related to the zombie process problem tracked in this issue.

What you're seeing is that a single process crashing causes the remaining processes to launch NCCL kernels that will never complete. This is a known problem with NCCL and has been addressed in the most recent minor release (2.4). There is work in progress to add the error detection to the NCCL bindings in PyTorch in pytorch/pytorch#22907. Once that is done and merged, the remaining processes will raise an error once one of its peers is no longer reachable or has crashed.

Marcovaldong commented 5 years ago

@pietern Thanks for your reply. I have fixed my problem. There is a dirty sample in my 700k train dataset, I have checked out it.

jrsykes commented 2 years ago

I'm still having this issue in 2022. It occurs when my training process goes awry and a tensor of NaN values is fed to torch.nn.functional.binary_cross_entropy. I then have to close the terminal window and cannot kill the resulting zombie process. The only solution seems to be to restart the server. It may be a coincidence but this behaviour is new since upgrading the NVIDIA driver a few days ago.

p.s. training with two different GPUs using nn.DataParallel.

Has anyone found a solution yet? None of the solutions above work for me.

CUDA version: 11.7 PyTorch: 1.11.0 Python 3.7.13 Ubuntu 18.04.6 LTS NVIDIA-SMI 515.43.04
Driver Version: 515.43.04

ilml commented 1 year ago

Same here

facebookresearch / maskrcnn-benchmark