Closed lorafei closed 2 years ago
Something is wrong with your environment
Today I rebuild the environment and run
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 parlai tm -t msc \
--model-file /home/sysadmin/fei/ParlAI/log/msc/MemoryLongRagAgent \
--model projects.msc.agents.memory_agent:MemoryLongRagAgent \
--generation-model bart --init-opt arch/bart_large \
--knowledge-access-method memory_only --batchsize 16 -lr 1e-05 --num_epochs 1 \
--save-after-valid True --validation-every-n-epochs 0.1 --validation-max-exs 20000 \
--fp16 true --fp16_impl mem_efficient --truncate 128 --label_truncate 128 \
--log_every_n_steps 1 --model-parallel true
Now I am using --model-parallel on eight A-100 GPU, training time is extremely slow. Logs
12:57:09 | training...
/home/sysadmin/fei/ParlAI/parlai/core/torch_generator_agent.py:1749: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
hyp_ids = best_idxs // voc_size
12:58:45 | time:96s total_exs:16 total_steps:1 epochs:0.00 time_left:1414736s
clen clip ctpb ctps ctrunc ctrunclen exps exs fp16_loss_scalar gnorm gpu_mem llen loss lr ltpb ltps ltrunc ltrunclen ppl token_acc token_em total_train_updates tpb tps ups
all 290.7 1 1838 19.3 .6667 186.7 .1680 16 16384 inf .07243 17.07 4.042 1e-05 283 2.971 0 0 57.5 .3212 0 1 2121 22.27 .0105
msc:Session1Self 56 0 0 3 13 4.165 0 0 64.39 .3590 0
msc_dialogue_2 223.3 1 95.33 9 18.22 3.844 0 0 46.73 .3171 0
msc_dialogue_3 592.8 1 464.8 4 20 4.117 0 0 61.38 .2875 0
/home/sysadmin/fei/ParlAI/parlai/core/torch_generator_agent.py:1749: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
hyp_ids = best_idxs // voc_size
13:00:24 | time:195s total_exs:32 total_steps:2 epochs:0.00 time_left:1441241s
clen clip ctpb ctps ctrunc ctrunclen exps exs fp16_loss_scalar gnorm gpu_mem llen loss lr ltpb ltps ltrunc ltrunclen ppl token_acc token_em total_train_updates tpb tps ups
all 328.7 1 1921 19.43 .6667 215.5 .1618 16 8192 inf .1102 29.91 4.371 1e-05 500 5.057 0 0 82.24 .3138 0 2 2421 24.48 .01012
msc:Session1Self 83.67 0 0 3 15 4.139 0 0 62.75 .4444 0
msc_dialogue_2 262.4 1 134.4 9 31.22 4.219 0 0 67.94 .2384 0
msc_dialogue_3 640 1 512 4 43.5 4.754 0 0 116 .2586 0
/home/sysadmin/fei/ParlAI/parlai/core/torch_generator_agent.py:1749: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
hyp_ids = best_idxs // voc_size
13:02:08 | time:298s total_exs:48 total_steps:3 epochs:0.00 time_left:1472768s
clen clip ctpb ctps ctrunc ctrunclen exps exs fp16_loss_scalar gnorm gpu_mem llen loss lr ltpb ltps ltrunc ltrunclen ppl token_acc token_em total_train_updates tpb tps ups
all 389.9 1 2021 19.53 .6667 265.5 .1546 16 8192 81.18 .1278 30.91 4.358 1e-05 563 5.442 0 0 83.61 .2602 0 3 2584 24.98 .009668
msc:Session1Self 117 0 0 3 15.67 4.709 0 0 110.9 .2128 0
msc_dialogue_2 319.6 1 191.6 9 41.56 4.548 0 0 94.44 .2861 0
msc_dialogue_3 733 1 605 4 35.5 3.817 0 0 45.45 .2817 0
/home/sysadmin/fei/ParlAI/parlai/core/torch_generator_agent.py:1749: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
hyp_ids = best_idxs // voc_size
13:03:55 | time:406s total_exs:64 total_steps:4 epochs:0.00 time_left:1503023s
clen clip ctpb ctps ctrunc ctrunclen exps exs fp16_loss_scalar gnorm gpu_mem llen loss lr ltpb ltps ltrunc ltrunclen ppl token_acc token_em total_train_updates tpb tps ups
all 448.2 1 2048 19.07 1 320.2 .1490 16 8192 48.76 .1062 27.57 4.35 1e-05 479 4.461 0 0 83.14 .2755 0 4 2527 23.53 .009316
msc:Session1Self 149.7 1 21.67 3 13 4.514 0 0 91.26 .2051 0
msc_dialogue_2 385.3 1 257.3 9 32.22 3.806 0 0 44.99 .3414 0
msc_dialogue_3 809.8 1 681.8 4 37.5 4.729 0 0 113.2 .2800 0
/home/sysadmin/fei/ParlAI/parlai/core/torch_generator_agent.py:1749: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
hyp_ids = best_idxs // voc_size
13:05:48 | time:519s total_exs:80 total_steps:5 epochs:0.00 time_left:1537141s
clen clip ctpb ctps ctrunc ctrunclen exps exs fp16_loss_scalar gnorm gpu_mem llen loss lr ltpb ltps ltrunc ltrunclen ppl token_acc token_em total_train_updates tpb tps ups
all 509.5 1 2048 18.16 1 381.5 .1419 16 8192 16.14 .1020 28.24 3.769 1e-05 491 4.355 0 0 43.72 .3179 0 5 2539 22.52 .008871
msc:Session1Self 178 1 50 3 15.67 3.952 0 0 52.03 .2979 0
msc_dialogue_2 442.1 1 314.1 9 33.56 3.661 0 0 38.9 .3179 0
msc_dialogue_3 908.5 1 780.5 4 35.5 3.695 0 0 40.23 .3380 0
/home/sysadmin/fei/ParlAI/parlai/core/torch_generator_agent.py:1749: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
hyp_ids = best_idxs // voc_size
13:07:33 | time:624s total_exs:96 total_steps:6 epochs:0.00 time_left:1539989s
clen clip ctpb ctps ctrunc ctrunclen exps exs fp16_loss_scalar gnorm gpu_mem llen loss lr ltpb ltps ltrunc ltrunclen ppl token_acc token_em total_train_updates tpb tps ups
all 552.5 1 2048 19.56 1 424.5 .1528 16 8192 13.04 .1062 28.03 3.988 1e-05 443 4.231 0 0 68.81 .2940 0 6 2491 23.79 .009553
msc:Session1Self 208.3 1 80.33 3 11.67 4.905 0 0 134.9 .2571 0
msc_dialogue_2 457.9 1 329.9 9 23.67 3.839 0 0 46.5 .2864 0
msc_dialogue_3 991.2 1 863.2 4 48.75 3.218 0 0 24.98 .3385 0
/home/sysadmin/fei/ParlAI/parlai/core/torch_generator_agent.py:1749: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
hyp_ids = best_idxs // voc_size
Nvidia-smi
Every 2.0s: nvidia-smi Wed Mar 9 13:08:48 2022
Wed Mar 9 13:08:48 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:0F:00.0 Off | 0 |
| N/A 27C P0 66W / 400W | 22788MiB / 40536MiB | 27% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... Off | 00000000:14:00.0 Off | 0 |
| N/A 28C P0 62W / 400W | 7260MiB / 40536MiB | 6% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... Off | 00000000:4A:00.0 Off | 0 |
| N/A 27C P0 64W / 400W | 6320MiB / 40536MiB | 3% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... Off | 00000000:50:00.0 Off | 0 |
| N/A 31C P0 73W / 400W | 6320MiB / 40536MiB | 2% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... Off | 00000000:93:00.0 Off | 0 |
| N/A 31C P0 70W / 400W | 6320MiB / 40536MiB | 2% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... Off | 00000000:99:00.0 Off | 0 |
| N/A 27C P0 71W / 400W | 6754MiB / 40536MiB | 6% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... Off | 00000000:CB:00.0 Off | 0 |
| N/A 28C P0 61W / 400W | 6754MiB / 40536MiB | 6% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... Off | 00000000:D0:00.0 Off | 0 |
| N/A 27C P0 58W / 400W | 6978MiB / 40536MiB | 6% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 46848 C ...a3/envs/parlai/bin/python 22785MiB |
| 1 N/A N/A 46848 C ...a3/envs/parlai/bin/python 7257MiB |
| 2 N/A N/A 46848 C ...a3/envs/parlai/bin/python 6317MiB |
| 3 N/A N/A 46848 C ...a3/envs/parlai/bin/python 6317MiB |
| 4 N/A N/A 46848 C ...a3/envs/parlai/bin/python 6317MiB |
| 5 N/A N/A 46848 C ...a3/envs/parlai/bin/python 6751MiB |
| 6 N/A N/A 46848 C ...a3/envs/parlai/bin/python 6751MiB |
| 7 N/A N/A 46848 C ...a3/envs/parlai/bin/python 6975MiB |
+-----------------------------------------------------------------------------+
Bug description When I'm runing project msc with Blenderbot and change the generator model with BART. The model-parallel method is extremely slow. If I use one V100 GPU, it took about 4 days to train. But when I use --model-parallel for 8 V100, it took about 30 days to train. Since I can put the model into one GPU, which took about 16000MiB memory, so I think there is no need to slice the model into several pieces, so I want to change model-parallel into multi-processing. But when I simply change train_model --model_parallel to multiprocessing_train, there will be a NCCL version error.
Reproduction steps
Logs Please paste the command line output:
Output goes here
Additional context I am hoping if anyone can tell me how to use multiprocessing_train for blenderbot2.0. Many thanks!