Closed wenting-zhao closed 3 years ago
Hi, can you verify that nn.DataParallel and cuda are working on the GPU? I.e. are lines 106-115 in main.py executing? https://github.com/QData/LaMP/blob/master/main.py#L106-L115
Yes! I can verify they are running those lines. So it prints Using 5 GPUs!
I also ran nvidia-smi and here is the result
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:18:00.0 Off | 0 |
| N/A 28C P0 42W / 250W | 1777MiB / 16160MiB | 22% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 |
| N/A 28C P0 48W / 250W | 1471MiB / 16160MiB | 18% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE... Off | 00000000:5E:00.0 Off | 0 |
| N/A 28C P0 41W / 250W | 1471MiB / 16160MiB | 17% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... Off | 00000000:86:00.0 Off | 0 |
| N/A 29C P0 38W / 250W | 1471MiB / 16160MiB | 19% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 |
| N/A 29C P0 37W / 250W | 1399MiB / 16160MiB | 18% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 6804 C python 1765MiB |
| 1 6804 C python 1459MiB |
| 2 6804 C python 1459MiB |
| 3 6804 C python 1459MiB |
| 4 6804 C python 1387MiB |
+-----------------------------------------------------------------------------+
Hi there,
When I train with 5 GPUs (i.e.,
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python main.py -dataset nuswide_vector -batch_size 32 -d_model 512 -d_inner_hid 512 -n_layers_enc 2 -n_layers_dec 2 -n_head 4 -epoch 50 -dropout 0.2 -dec_dropout 0.2 -lr 0.0002 -encoder 'mlp' -decoder 'graph' -label_mask 'prior
), I got ``(Training) elapse: 11.401 min''.However, when I train with 1 GPU (i.e.,
CUDA_VISIBLE_DEVICES=0 python main.py -dataset nuswide_vector -batch_size 32 -d_model 512 -d_inner_hid 512 -n_layers_enc 2 -n_layers_dec 2 -n_head 4 -epoch 50 -dropout 0.2 -dec_dropout 0.2 -lr 0.0002 -encoder 'mlp' -decoder 'graph' -label_mask 'prior
), I got(Training) elapse: 1.766 min
.I was wondering what might happen there? Have you run into something similar before? Thanks in advance!