How can I change Blendebot2's model-parallel method to multiprocessing_train?

Bug description When I'm runing project msc with Blenderbot and change the generator model with BART. The model-parallel method is extremely slow. If I use one V100 GPU, it took about 4 days to train. But when I use --model-parallel for 8 V100, it took about 30 days to train. Since I can put the model into one GPU, which took about 16000MiB memory, so I think there is no need to slice the model into several pieces, so I want to change model-parallel into multi-processing. But when I simply change train_model --model_parallel to multiprocessing_train, there will be a NCCL version error.

Reproduction steps

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 parlai multiprocessing_train -t msc \
-t msc --model-file /home/sysadmin/fei/ParlAI/log/msc/MemoryLongRagAgent \
--model projects.msc.agents.memory_agent:MemoryLongRagAgent \
--generation-model bart --init-opt arch/bart_large \
--knowledge-access-method memory_only --batchsize 8 -lr 1e-05 --num_epochs 1 \
--save-after-valid True --validation-every-n-epochs 0.1 --validation-max-exs 20000 \
--fp16 true --truncate 128 --label_truncate 128 \
--log_every_n_steps 5 --num-workers 8

Logs Please paste the command line output:

12:22:56 | Distributed group initialized

12:23:04 | building dictionary first...
12:23:04 | No model with opt yet at: /home/sysadmin/fei/ParlAI/log/msc/MemoryLongRagAgent(.opt)
12:23:04 | your model is being loaded with opts that do not exist in the model you are initializing the weights with: download_path: None,verbose: False,datapat
h: /home/sysadmin/fei/ParlAI/data,evaltask: None,final_extra_opt: ,eval_batchsize: None,eval_dynamic_batching: None,num_workers: 8,display_examples: False,num_e
pochs: 1.0,max_train_time: -1,max_train_steps: -1,log_every_n_steps: 5,validation_every_n_secs: -1,validation_every_n_steps: -1,save_every_n_secs: -1,save_after
_valid: True,validation_every_n_epochs: 0.1,validation_max_exs: 20000,short_final_eval: False,validation_patience: 10,validation_metric: accuracy,validation_met
ric_mode: None,validation_cutoff: 1.0,load_from_checkpoint: True,validation_share_agent: False,metrics: default,aggregate_micro: False,tensorboard_log: True,ten
sorboard_logdir: None,wandb_log: False,wandb_name: None,wandb_project: rag,wandb_entity: None,dict_maxexs: -1,dict_include_valid: False,dict_include_test: False
,log_every_n_secs: -1,distributed_world_size: 8,ddp_backend: ddp,port: None,include_session1: True,include_last_session: False,session_id: 2,previous_persona_ty
pe: raw_history,session_openning: False,label_speaker_id: both,include_time_gap: False,history_time_gaps_token: None,history_person_tokens: None,previous_sessio
n_delimiter: None,mutators: None,your_persona_first: True,max_num_turns: -1,is_convai2_session_level: False,search_query_generator_model_file: None,search_query
_generator_inference: greedy,search_query_generator_beam_min_length: 1,search_query_generator_beam_size: 1,search_query_generator_text_truncate: 512,splitted_ch
unk_length: 256,doc_chunk_split_mode: word,n_ranked_doc_chunks: 1,doc_chunks_ranker: head,woi_doc_chunk_size: 500,search_server: None,knowledge_access_method: m
emory_only,memory_key: full_text,query_generator_key: full_text,gold_document_key: __selected-docs__,gold_sentence_key: __selected-sentences__,gold_document_tit
les_key: __select-docs-titles__,skip_search_key: skip_search,insert_gold_docs: False,memory_extractor_phrase: persona:,retriever_ignore_phrase: persona:,query_g
enerator_ignore_phrase: persona:,query_generator_model_file: zoo:blenderbot2/query_generator/model,query_generator_delimiter:
,query_generator_inference: beam,query_generator_beam_size: 1,query_generator_beam_min_length: 2,query_generator_truncate: -1,memory_retriever_truncate: -1,retr
iever_delimiter:
,share_search_and_memory_query_encoder: False,memory_reader_model: None,memory_doc_title_delimiter:  / ,memory_writer_model: bert,memory_writer_model_file: zoo:
hallucination/multiset_dpr/hf_bert_base.cp,memory_decoder_key: full_text,memory_decoder_ignore_phrase: persona:,memory_decoder_model_file: zoo:blenderbot2/memor
y_decoder/model,memory_decoder_delimiter:
,memory_decoder_beam_size: 3,memory_decoder_beam_min_length: 10,memory_decoder_truncate: -1,memory_decoder_one_line_memories: False,memory_extractor_phrases: No
ne,retriever_ignore_phrases: None,memory_delimiter: None,candidates: inline,eval_candidates: inline,interactive_candidates: fixed,repeat_blocking_heuristic: Tru
e,fixed_candidates_path: None,fixed_candidate_vecs: reuse,encode_candidate_vecs: True,encode_candidate_vecs_batchsize: 256,train_predict: False,cap_num_predicti
ons: 100,ignore_bad_candidates: False,rank_top_k: -1,return_cand_scores: False,use_memories: False,wrap_memory_encoder: False,memory_attention: sqrt,normalize_s
ent_emb: False,share_encoders: True,learn_embeddings: True,data_parallel: False,reduction_type: mean,polyencoder_type: codes,poly_n_codes: 64,poly_attention_typ
e: basic,poly_attention_num_heads: 4,codes_attention_type: basic,codes_attention_num_heads: 4,query_model: bert,rag_model_type: token,thorough: False,n_extra_po
sitions: 0,gold_knowledge_passage_key: checked_sentence,gold_knowledge_title_key: title,rag_retriever_query: full_history,rag_retriever_type: dpr,retriever_debu
g_index: None,n_docs: 5,min_doc_token_length: 64,max_doc_token_length: 256,rag_query_truncate: 512,print_docs: False,use_topic_wow_passages: True,path_to_index:
 zoo:hallucination/wiki_index_compressed/compressed_pq,path_to_dense_embeddings: None,dpr_model_file: zoo:hallucination/multiset_dpr/hf_bert_base.cp,path_to_dpr
_passages: zoo:hallucination/wiki_passages/psgs_w100.tsv,retriever_embedding_size: 768,tfidf_max_doc_paragraphs: -1,tfidf_model_path: zoo:wikipedia_full/tfidf_r
etriever/model,dpr_num_docs: 25,poly_score_initial_lambda: 0.5,polyencoder_init_model: wikito,poly_faiss_model_file: None,regret: False,regret_intermediate_maxl
en: 32,regret_model_file: None,regret_dict_file: None,regret_override_index: False,indexer_type: compressed,indexer_buffer_size: 65536,compressed_indexer_factor
y: IVF4096_HNSW128,PQ128,compressed_indexer_gpu_train: False,compressed_indexer_nprobe: 64,hnsw_indexer_store_n: 128,hnsw_ef_search: 128,hnsw_ef_construction: 2
00,rag_turn_n_turns: 2,rag_turn_marginalize: doc_then_turn,rag_turn_discount_factor: 1.0,t5_model_arch: t5-base,t5_model_parallel: False,t5_dropout: 0.0,t5_gene
ration_config: None,interactive_mode: False,n_positions_init: None,generation_model: bart,max_memories: 10,rank: 0,multiprocessing: True

Output goes here

rank:  3 | 12:25:19 | Traceback (most recent call last):
  File "/home/sysadmin/fei/ParlAI/parlai/scripts/multiprocessing_train.py", line 45, in multiprocess_train
    return single_train.TrainLoop(opt).train()
  File "/home/sysadmin/fei/ParlAI/parlai/scripts/train_model.py", line 355, in __init__
    self.agent = create_agent(opt)
  File "/home/sysadmin/fei/ParlAI/parlai/core/agents.py", line 479, in create_agent
    model = model_class(opt)
  File "/home/sysadmin/fei/ParlAI/parlai/agents/rag/rag.py", line 187, in __init__
    self._generation_agent.__init__(self, opt, shared)  # type: ignore
  File "/home/sysadmin/fei/ParlAI/parlai/agents/bart/bart.py", line 72, in __init__
    super().__init__(opt, shared)
  File "/home/sysadmin/fei/ParlAI/parlai/core/torch_generator_agent.py", line 484, in __init__
    self.model = fsdp_utils.fsdp_wrap(self.build_model())
  File "/home/sysadmin/fei/ParlAI/projects/msc/agents/memory_agent.py", line 187, in build_model
    model = BlenderBot2RagModel(self.opt, self.dict)
  File "/home/sysadmin/fei/ParlAI/projects/blenderbot2/agents/modules.py", line 103, in __init__
    self.memory_decoder = MemoryDecoder(opt) 
  File "/home/sysadmin/fei/ParlAI/projects/blenderbot2/agents/sub_modules.py", line 295, in __init__
    model_file, opt_overrides=overrides
  File "/home/sysadmin/fei/ParlAI/parlai/core/agents.py", line 347, in create_agent_from_model_file
    return create_agent_from_opt_file(opt)
  File "/home/sysadmin/fei/ParlAI/parlai/core/agents.py", line 421, in create_agent_from_opt_file
    return model_class(opt_from_file)
  File "/home/sysadmin/fei/ParlAI/parlai/agents/bart/bart.py", line 72, in __init__
    super().__init__(opt, shared)
  File "/home/sysadmin/fei/ParlAI/parlai/core/torch_generator_agent.py", line 505, in __init__
    sync_parameters(self.model)
  File "/home/sysadmin/fei/ParlAI/parlai/utils/distributed.py", line 212, in sync_parameters
    dist.all_reduce(p.data, dist.ReduceOp.SUM)
  File "/home/sysadmin/.conda/envs/parlai/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1285, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc)

Additional context I am hoping if anyone can tell me how to use multiprocessing_train for blenderbot2.0. Many thanks!

Today I rebuild the environment and run

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 parlai tm -t msc \
--model-file /home/sysadmin/fei/ParlAI/log/msc/MemoryLongRagAgent \
--model projects.msc.agents.memory_agent:MemoryLongRagAgent \
--generation-model bart --init-opt arch/bart_large \
--knowledge-access-method memory_only --batchsize 16 -lr 1e-05 --num_epochs 1 \
--save-after-valid True --validation-every-n-epochs 0.1 --validation-max-exs 20000 \
--fp16 true --fp16_impl mem_efficient --truncate 128 --label_truncate 128 \
--log_every_n_steps 1 --model-parallel true

Now I am using --model-parallel on eight A-100 GPU, training time is extremely slow. Logs

12:57:09 | training...
/home/sysadmin/fei/ParlAI/parlai/core/torch_generator_agent.py:1749: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  hyp_ids = best_idxs // voc_size
12:58:45 | time:96s total_exs:16 total_steps:1 epochs:0.00 time_left:1414736s
                     clen  clip  ctpb  ctps  ctrunc  ctrunclen  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  llen  loss    lr  ltpb  ltps  ltrunc  ltrunclen   ppl  token_acc  token_em  total_train_updates  tpb   tps   ups
   all              290.7     1  1838  19.3   .6667      186.7 .1680   16             16384    inf   .07243 17.07 4.042 1e-05   283 2.971       0          0  57.5      .3212         0                    1 2121 22.27 .0105
   msc:Session1Self    56                         0          0          3                                      13 4.165                         0          0 64.39      .3590         0
   msc_dialogue_2   223.3                         1      95.33          9                                   18.22 3.844                         0          0 46.73      .3171         0
   msc_dialogue_3   592.8                         1      464.8          4                                      20 4.117                         0          0 61.38      .2875         0

/home/sysadmin/fei/ParlAI/parlai/core/torch_generator_agent.py:1749: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  hyp_ids = best_idxs // voc_size
13:00:24 | time:195s total_exs:32 total_steps:2 epochs:0.00 time_left:1441241s
                     clen  clip  ctpb  ctps  ctrunc  ctrunclen  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  llen  loss    lr  ltpb  ltps  ltrunc  ltrunclen   ppl  token_acc  token_em  total_train_updates  tpb   tps    ups
   all              328.7     1  1921 19.43   .6667      215.5 .1618   16              8192    inf    .1102 29.91 4.371 1e-05   500 5.057       0          0 82.24      .3138         0                    2 2421 24.48 .01012
   msc:Session1Self 83.67                         0          0          3                                      15 4.139                         0          0 62.75      .4444         0
   msc_dialogue_2   262.4                         1      134.4          9                                   31.22 4.219                         0          0 67.94      .2384         0
   msc_dialogue_3     640                         1        512          4                                    43.5 4.754                         0          0   116      .2586         0

/home/sysadmin/fei/ParlAI/parlai/core/torch_generator_agent.py:1749: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  hyp_ids = best_idxs // voc_size
13:02:08 | time:298s total_exs:48 total_steps:3 epochs:0.00 time_left:1472768s
                     clen  clip  ctpb  ctps  ctrunc  ctrunclen  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  llen  loss    lr  ltpb  ltps  ltrunc  ltrunclen   ppl  token_acc  token_em  total_train_updates  tpb   tps     ups
   all              389.9     1  2021 19.53   .6667      265.5 .1546   16              8192  81.18    .1278 30.91 4.358 1e-05   563 5.442       0          0 83.61      .2602         0                    3 2584 24.98 .009668
   msc:Session1Self   117                         0          0          3                                   15.67 4.709                         0          0 110.9      .2128         0
   msc_dialogue_2   319.6                         1      191.6          9                                   41.56 4.548                         0          0 94.44      .2861         0
   msc_dialogue_3     733                         1        605          4                                    35.5 3.817                         0          0 45.45      .2817         0

/home/sysadmin/fei/ParlAI/parlai/core/torch_generator_agent.py:1749: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  hyp_ids = best_idxs // voc_size
13:03:55 | time:406s total_exs:64 total_steps:4 epochs:0.00 time_left:1503023s
                     clen  clip  ctpb  ctps  ctrunc  ctrunclen  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  llen  loss    lr  ltpb  ltps  ltrunc  ltrunclen   ppl  token_acc  token_em  total_train_updates  tpb   tps     ups
   all              448.2     1  2048 19.07       1      320.2 .1490   16              8192  48.76    .1062 27.57  4.35 1e-05   479 4.461       0          0 83.14      .2755         0                    4 2527 23.53 .009316
   msc:Session1Self 149.7                         1      21.67          3                                      13 4.514                         0          0 91.26      .2051         0
   msc_dialogue_2   385.3                         1      257.3          9                                   32.22 3.806                         0          0 44.99      .3414         0
   msc_dialogue_3   809.8                         1      681.8          4                                    37.5 4.729                         0          0 113.2      .2800         0

/home/sysadmin/fei/ParlAI/parlai/core/torch_generator_agent.py:1749: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  hyp_ids = best_idxs // voc_size
13:05:48 | time:519s total_exs:80 total_steps:5 epochs:0.00 time_left:1537141s
                     clen  clip  ctpb  ctps  ctrunc  ctrunclen  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  llen  loss    lr  ltpb  ltps  ltrunc  ltrunclen   ppl  token_acc  token_em  total_train_updates  tpb   tps     ups
   all              509.5     1  2048 18.16       1      381.5 .1419   16              8192  16.14    .1020 28.24 3.769 1e-05   491 4.355       0          0 43.72      .3179         0                    5 2539 22.52 .008871
   msc:Session1Self   178                         1         50          3                                   15.67 3.952                         0          0 52.03      .2979         0
   msc_dialogue_2   442.1                         1      314.1          9                                   33.56 3.661                         0          0  38.9      .3179         0
   msc_dialogue_3   908.5                         1      780.5          4                                    35.5 3.695                         0          0 40.23      .3380         0

/home/sysadmin/fei/ParlAI/parlai/core/torch_generator_agent.py:1749: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  hyp_ids = best_idxs // voc_size
13:07:33 | time:624s total_exs:96 total_steps:6 epochs:0.00 time_left:1539989s
                     clen  clip  ctpb  ctps  ctrunc  ctrunclen  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  llen  loss    lr  ltpb  ltps  ltrunc  ltrunclen   ppl  token_acc  token_em  total_train_updates  tpb   tps     ups
   all              552.5     1  2048 19.56       1      424.5 .1528   16              8192  13.04    .1062 28.03 3.988 1e-05   443 4.231       0          0 68.81      .2940         0                    6 2491 23.79 .009553
   msc:Session1Self 208.3                         1      80.33          3                                   11.67 4.905                         0          0 134.9      .2571         0
   msc_dialogue_2   457.9                         1      329.9          9                                   23.67 3.839                         0          0  46.5      .2864         0
   msc_dialogue_3   991.2                         1      863.2          4                                   48.75 3.218                         0          0 24.98      .3385         0

/home/sysadmin/fei/ParlAI/parlai/core/torch_generator_agent.py:1749: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  hyp_ids = best_idxs // voc_size

Nvidia-smi

Every 2.0s: nvidia-smi                                                                                                                                                                                                                                                                                                                                                           Wed Mar  9 13:08:48 2022

Wed Mar  9 13:08:48 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:0F:00.0 Off |                    0 |
| N/A   27C    P0    66W / 400W |  22788MiB / 40536MiB |     27%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:14:00.0 Off |                    0 |
| N/A   28C    P0    62W / 400W |   7260MiB / 40536MiB |      6%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  Off  | 00000000:4A:00.0 Off |                    0 |
| N/A   27C    P0    64W / 400W |   6320MiB / 40536MiB |      3%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  Off  | 00000000:50:00.0 Off |                    0 |
| N/A   31C    P0    73W / 400W |   6320MiB / 40536MiB |      2%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  Off  | 00000000:93:00.0 Off |                    0 |
| N/A   31C    P0    70W / 400W |   6320MiB / 40536MiB |      2%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  Off  | 00000000:99:00.0 Off |                    0 |
| N/A   27C    P0    71W / 400W |   6754MiB / 40536MiB |      6%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  Off  | 00000000:CB:00.0 Off |                    0 |
| N/A   28C    P0    61W / 400W |   6754MiB / 40536MiB |      6%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  Off  | 00000000:D0:00.0 Off |                    0 |
| N/A   27C    P0    58W / 400W |   6978MiB / 40536MiB |      6%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     46848      C   ...a3/envs/parlai/bin/python    22785MiB |
|    1   N/A  N/A     46848      C   ...a3/envs/parlai/bin/python     7257MiB |
|    2   N/A  N/A     46848      C   ...a3/envs/parlai/bin/python     6317MiB |
|    3   N/A  N/A     46848      C   ...a3/envs/parlai/bin/python     6317MiB |
|    4   N/A  N/A     46848      C   ...a3/envs/parlai/bin/python     6317MiB |
|    5   N/A  N/A     46848      C   ...a3/envs/parlai/bin/python     6751MiB |
|    6   N/A  N/A     46848      C   ...a3/envs/parlai/bin/python     6751MiB |
|    7   N/A  N/A     46848      C   ...a3/envs/parlai/bin/python     6975MiB |
+-----------------------------------------------------------------------------+

facebookresearch / ParlAI

How can I change Blendebot2's model-parallel method to multiprocessing_train? #4401