Open yxchng opened 5 years ago
How did you launch your 2 GPU job? This behavior is not expected.
Also, I just noticed that you have two different GPUs. What might be happening is that the fastest GPU is waiting for the slowest GPU to finish its iteration.
It seems that 2080Ti do not have peer2peer enabled, which can make multi-GPU training much slower as memory transfer between GPUs should pass via the CPU
https://www.pugetsystems.com/labs/hpc/P2P-peer-to-peer-on-NVIDIA-RTX-2080Ti-vs-GTX-1080Ti-GPUs-1331/
I reinstalled NVIDIA driver and installed the latest pytorch-nightly and the problem disappears.
@fmassa My previous assessment of the problem was wrong. The actual problem is that the program turns into zombie process often when I ctrl-c
to kill it, meaning it is no longer running but still hogging the memory and appears in top
and nvidia-smi
. The 100% utilization displayed in nvidia-smi
is misleading because the program has already stopped. I always have to kill each started process manually by their PIDs using the kill
command. Sometimes, even killing doesn't work. In those cases, I can only reboot my computer.
I run my GPU job using the command
NGPU=2
python -m torch.distributed.launch --nproc_per_node=$NGPU tools/train_net.py --config-file configs/<...>
One of the config files I used is as such
MODEL:
META_ARCHITECTURE: "GeneralizedRCNN"
WEIGHT: "catalog://ImageNetPretrained/FAIR/20171220/X-101-32x8d"
BACKBONE:
CONV_BODY: "R-101-FPN"
RPN:
USE_FPN: True
ANCHOR_STRIDE: (4, 8, 16, 32, 64)
PRE_NMS_TOP_N_TRAIN: 2000
PRE_NMS_TOP_N_TEST: 1000
POST_NMS_TOP_N_TEST: 1000
FPN_POST_NMS_TOP_N_TEST: 1000
ROI_HEADS:
USE_FPN: True
ROI_BOX_HEAD:
POOLER_RESOLUTION: 7
POOLER_SCALES: (0.25, 0.125, 0.0625, 0.03125)
POOLER_SAMPLING_RATIO: 2
FEATURE_EXTRACTOR: "FPN2MLPFeatureExtractor"
PREDICTOR: "FPNPredictor"
RESNETS:
BACKBONE_OUT_CHANNELS: 256
STRIDE_IN_1X1: False
NUM_GROUPS: 32
WIDTH_PER_GROUP: 8
DATASETS:
TRAIN: ("crowdhuman_train", )
TEST: ("crowdhuman_val",)
DATALOADER:
SIZE_DIVISIBILITY: 32
SOLVER:
BASE_LR: 0.02
WEIGHT_DECAY: 0.0001
STEPS: (60000, 80000)
MAX_ITER: 90000
IMS_PER_BATCH: 2
TEST:
IMS_PER_BATCH: 2
INPUT:
MIN_SIZE_TRAIN: (800,)
MAX_SIZE_TRAIN: 1333
MIN_SIZE_TEST: 800
MAX_SIZE_TEST: 1333
OUTPUT_DIR: "results/exp2"
I have tried testing with other configs as well and the problem remains.
I am quite sure there is a bug in the code because this has happened in 2 different computers (I tried running it on AWS using 2x P100s as well).
PyTorch version: 1.1.0a0+be364ac Is debug build: No CUDA used to build PyTorch: 10.1.105
OS: Ubuntu 16.04.5 LTS GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609 CMake version: version 3.5.1
Python version: 3.6 Is CUDA available: Yes CUDA runtime version: 10.1.105 GPU models and configuration: GPU 0: Tesla P100-PCIE-16GB GPU 1: Tesla P100-PCIE-16GB
Nvidia driver version: 410.104 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.5.0
Versions of relevant libraries: [pip] msgpack-numpy==0.4.3.2 [pip] numpy==1.16.2 [pip] torch==1.1.0a0+be364ac [pip] torchtext==0.4.0 [pip] torchvision==0.2.1 [conda] blas 1.0 mkl anaconda [conda] magma-cuda100 2.1.0 5 local [conda] mkl 2019.1 144 [conda] mkl-include 2019.1 144 [conda] mkl_fft 1.0.10 py36ha843d7b_0 anaconda [conda] mkl_random 1.0.2 py36hd81dba3_0 anaconda [conda] torch 1.1.0a0+be364ac pypi_0 pypi [conda] torchtext 0.4.0 pypi_0 pypi [conda] torchvision 0.2.1 pypi_0 pypi Pillow (5.3.0.post0)
I thought I have solved it but apparently not.
This is a problem with the cleanup in PyTorch distributed launch utility, when one of the process dies the others might not be killed.
ccing @pietern to know if he has ideas on how to avoid this situation.
If you use ctrl-c
to stop the program, be careful to kill every process carefully. In your case(2gpus), there are around 2 + 8(dataloading) processes. I usually will run ps aux | grep python
to kill everything related to the training program.
@chengyangfu My expectation is that the ctrl-c
signal should propagate to every process and I shouldn't have to manually kill all of them. A good implementation of multiprocessing code should not have such problem and so I would think that this is a bug. I have not have time to read through the code yet but is this library just using tools provided by PyTorch such that the problem lies in PyTorch? It is still strange though because I have been using PyTorch's DataParallel all the time in my other multi-GPUs training code and I have not met such problem.
I was browsing through the issues and seems like this issue https://github.com/facebookresearch/maskrcnn-benchmark/issues/58 is related to the problem discussed here. Their root problem is probably the same where the coordination and communication among the many launched processes are problematic.
I meet the same problem
Same here
I met a similar problem. I trained the model with 4 gpus. Training for thousand mini-batches, one process dead (I cannot get when and how it dead) and the utilization of the other three gpus are maintained at 100%, but the training has been stopped.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26 Driver Version: 387.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 20% 27C P8 16W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 20% 32C P8 16W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 20% 28C P8 16W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 20% 30C P8 16W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 108... Off | 00000000:85:00.0 Off | N/A |
| 24% 58C P2 77W / 250W | 3764MiB / 11172MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 108... Off | 00000000:86:00.0 Off | N/A |
| 20% 54C P2 78W / 250W | 4110MiB / 11172MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 108... Off | 00000000:89:00.0 Off | N/A |
| 20% 29C P8 15W / 250W | 41MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 108... Off | 00000000:8A:00.0 Off | N/A |
| 24% 58C P2 74W / 250W | 3906MiB / 11172MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 4 65245 C ...ongsq/environments/anaconda3/bin/python 3731MiB |
| 5 65246 C ...ongsq/environments/anaconda3/bin/python 4077MiB |
| 7 65248 C ...ongsq/environments/anaconda3/bin/python 3873MiB |
+-----------------------------------------------------------------------------+
As shown above, the process which pid should be 65247 has been killed for some reason, How should I fix this problem? I cannot reinstall nvidia driver because of no root right.
@Marcovaldong This is not related to the zombie process problem tracked in this issue.
What you're seeing is that a single process crashing causes the remaining processes to launch NCCL kernels that will never complete. This is a known problem with NCCL and has been addressed in the most recent minor release (2.4). There is work in progress to add the error detection to the NCCL bindings in PyTorch in pytorch/pytorch#22907. Once that is done and merged, the remaining processes will raise an error once one of its peers is no longer reachable or has crashed.
@pietern Thanks for your reply. I have fixed my problem. There is a dirty sample in my 700k train dataset, I have checked out it.
I'm still having this issue in 2022. It occurs when my training process goes awry and a tensor of NaN values is fed to torch.nn.functional.binary_cross_entropy. I then have to close the terminal window and cannot kill the resulting zombie process. The only solution seems to be to restart the server. It may be a coincidence but this behaviour is new since upgrading the NVIDIA driver a few days ago.
p.s. training with two different GPUs using nn.DataParallel.
Has anyone found a solution yet? None of the solutions above work for me.
CUDA version: 11.7
PyTorch: 1.11.0
Python 3.7.13
Ubuntu 18.04.6 LTS
NVIDIA-SMI 515.43.04
Driver Version: 515.43.04
Same here
🐛 Bug
0% utilization in second GPU in 2x GPUs training
Is the second GPUs only used to store tensors? Is the multi GPUs training in this codebase specially implemented, such that it is different from the multi GPUs training in PyTorch?
To Reproduce
Run training code with 2 GPUs
Expected behavior
Comparable utilization in 2 GPUs?
Environment
UPDATE: Note that this is actually a wrong description of the problem but is still kept here just to keep the flow. The correct description of the problem is in the post below.