Closed DianCh closed 3 years ago
I found a work-around by specifying only 4 visible GPUs on machine 1:
# On machine 1
CUDA_VISIBLE_DEVICES=0,1,2,3 python tools/train_net.py --num-gpus 4 --config-file configs/COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml --machine-rank 1 --num-machines 2 --dist-url tcp://10.0.0.135:12345
After this the two nodes seem to connect and training starts; only 4 GPUs are taken by the 4 processes on machine 1 (previously 8 GPUs by 4 processes):
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 23173 C ...n/miniconda3/envs/detectron2/bin/python 7249MiB |
| 1 23174 C ...n/miniconda3/envs/detectron2/bin/python 8111MiB |
| 2 23175 C ...n/miniconda3/envs/detectron2/bin/python 8369MiB |
| 3 23176 C ...n/miniconda3/envs/detectron2/bin/python 10717MiB |
+-----------------------------------------------------------------------------+
However I don't think this is the intended usage. Why aren't GPUs allocated correctly previously?
This should be a pytorch issue: https://github.com/pytorch/pytorch/issues/52471
Instructions To Reproduce the Issue:
No changes; simply install detectron2 and execute example scripts.
a. Install detectron2 on both machines, by cloning and
pip install -e detectron2
b. Link dataset to the right place c. Run the launch commandsFirst, installation & dataset set up:
Then, run the following command on machine 0:
And run the following command on machine 1:
Terminal output of machine (node) 0:
which hangs after serializing dataset
Terminal output of machine (node) 1:
GPU usage of machine (node) 0 by
nvidia-smi
(seems correct):GPU usage of machine (node) 1 by
nvidia-smi
(WHY WOULD EACH PROCESS TAKE DOUBLED GPUS?):Expected behavior:
--num-gpus 4
in both commands. (which now takes 4 in machine 0 but 8 in machine 1)Environment:
Paste the output of the following command:
machine 0:
machine 1:
Am I missing anything? Thanks!