facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.48k stars 2.1k forks source link

3GB blender on k80 gives RuntimeError: CUDA error: device-side assert triggered #3375

Closed MikeyBeez closed 3 years ago

MikeyBeez commented 3 years ago

Cheers all. It's always something with me. ;)

Bug description python parlai/scripts/safe_interactive.py -t blended_skill_talk -mf zoo:blender/blender_3B/model

I get the prompt and enter Hello

RuntimeError: CUDA error: device-side assert triggered

Reproduction steps Enter steps to reproduce the behavior.

Expected behavior Give a clear and concise description of what you expected to happen.

Logs [ loading personas.. ]

[NOTE: In the BST paper both partners have a persona. You can choose to ignore yours, the model never sees it. In the Blender paper, this was not used for humans. You can also turn personas off with --include-personas False]

[context]: your persona: i now live in new mexico. your persona: i grew up in nevada. Enter Your Message: Hello Traceback (most recent call last): File "parlai/scripts/safe_interactive.py", line 87, in SafeInteractive.main() File "/home/bard/ParlAI/parlai/core/script.py", line 111, in main return cls._run_args(None) File "/home/bard/ParlAI/parlai/core/script.py", line 84, in _run_args return cls._run_from_parser_and_opt(opt, parser) File "/home/bard/ParlAI/parlai/core/script.py", line 90, in _run_from_parser_and_opt return script.run() File "parlai/scripts/safe_interactive.py", line 82, in run return safe_interactive(self.opt) File "parlai/scripts/safe_interactive.py", line 62, in safe_interactive world.parley() File "/home/bard/ParlAI/parlai/tasks/interactive/worlds.py", line 78, in parley acts[1] = agents[1].act() File "/home/bard/ParlAI/parlai/core/torch_agent.py", line 1946, in act response = self.batch_act([self.observation])[0] File "/home/bard/ParlAI/parlai/core/torch_agent.py", line 2007, in batch_act output = self.eval_step(batch) File "/home/bard/ParlAI/parlai/core/torch_generator_agent.py", line 891, in eval_step beam_preds_scores, beams = self._generate(batch, self.beam_size, maxlen) File "/home/bard/ParlAI/parlai/core/torch_generator_agent.py", line 1135, in _generate score, incr_state = model.decoder(decoder_input, encoder_states, incr_state) File "/home/bard/miniconda3/envs/parlai/lib/python3.7/site-packages/torch-1.7.1-py3.7-linux-x86_64.egg/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/bard/ParlAI/parlai/agents/transformer/modules.py", line 888, in forward tensor = self.forward_embedding(input, positions) File "/home/bard/ParlAI/parlai/agents/transformer/modules.py", line 810, in forward_embedding if positions.max().item() > self.n_positions: RuntimeError: CUDA error: device-side assert triggered

Additional context https://i.ytimg.com/vi/CEVaHj73s5g/maxresdefault.jpg

stephenroller commented 3 years ago

Can you try again with CUDA_LAUNCH_BLOCKING=1? Python doesn't give accurate stack traces for cuda errors, except when using this slower debug mode.

Typically this error occurs when the model is asked to read/write a sentence that's too long, but we have protections for that. Seeing the true stack trace may help identify.

MikeyBeez commented 3 years ago

I'm guessing that's an environment variable:

export CUDA_LAUNCH_BLOCKING=1 ❯ python parlai/scripts/safe_interactive.py -t blended_skill_talk -mf zoo:blender/blender_3B/model -bs=1 12:21:30 | Overriding opt["task"] to blended_skill_talk (previously: internal:blended_skill_talk,wizard_of_wikipedia,convai2:normalized,empathetic_dialogues) 12:21:30 | Overriding opt["model_file"] to /home/bard/ParlAI/data/models/blender/blender_3B/model (previously: /checkpoint/edinan/20200331/finetune_bst_gen_baseline_convai2_normal/de6/model) 12:21:30 | Loading model with --beam-block-full-context false 12:21:30 | Using CUDA 12:21:30 | loading dictionary from /home/bard/ParlAI/data/models/blender/blender_3B/model.dict 12:21:30 | num words = 8008 12:21:30 | TransformerGenerator: full interactive mode on. 12:21:59 | Total parameters: 2,696,268,800 (2,695,613,440 trainable) 12:21:59 | Loading existing model params from /home/bard/ParlAI/data/models/blender/blender_3B/model 12:22:03 | Opt: 12:22:03 | activation: gelu 12:22:03 | adafactor_eps: '[1e-30, 0.001]' 12:22:03 | adam_eps: 1e-08 12:22:03 | add_p1_after_newln: False 12:22:03 | aggregate_micro: False 12:22:03 | allow_missing_init_opts: False 12:22:03 | attention_dropout: 0.0 12:22:03 | batchsize: 128 12:22:03 | beam_block_full_context: False 12:22:03 | beam_block_list_filename: None 12:22:03 | beam_block_ngram: 3 12:22:03 | beam_context_block_ngram: 3 12:22:03 | beam_delay: 30 12:22:03 | beam_length_penalty: 0.65 12:22:03 | beam_min_length: 20 12:22:03 | beam_size: 10 12:22:03 | betas: '[0.9, 0.999]' 12:22:03 | bpe_add_prefix_space: True 12:22:03 | bpe_debug: False 12:22:03 | bpe_dropout: None 12:22:03 | bpe_merge: /home/bard/ParlAI/data/models/blender/blender_3B/model.dict-merges.txt 12:22:03 | bpe_vocab: /home/bard/ParlAI/data/models/blender/blender_3B/model.dict-vocab.json 12:22:03 | compute_tokenized_bleu: False 12:22:03 | datapath: /home/bard/ParlAI/data 12:22:03 | datatype: train 12:22:03 | delimiter: ' ' 12:22:03 | dict_class: parlai.core.dict:DictionaryAgent 12:22:03 | dict_endtoken: end 12:22:03 | dict_file: /home/bard/ParlAI/data/models/blender/blender_3B/model.dict 12:22:03 | dict_include_test: False 12:22:03 | dict_include_valid: False 12:22:03 | dict_initpath: None 12:22:03 | dict_language: english 12:22:03 | dict_loaded: True 12:22:03 | dict_lower: False 12:22:03 | dict_max_ngram_size: -1 12:22:03 | dict_maxexs: -1 12:22:03 | dict_maxtokens: -1 12:22:03 | dict_minfreq: 0 12:22:03 | dict_nulltoken: null 12:22:03 | dict_starttoken: start 12:22:03 | dict_textfields: text,labels 12:22:03 | dict_tokenizer: bytelevelbpe 12:22:03 | dict_unktoken: unk 12:22:03 | display_add_fields: 12:22:03 | display_examples: False 12:22:03 | display_partner_persona: True 12:22:03 | display_prettify: False 12:22:03 | download_path: None 12:22:03 | dropout: 0.1 12:22:03 | dynamic_batching: None 12:22:03 | embedding_projection: random 12:22:03 | embedding_size: 2560 12:22:03 | embedding_type: random 12:22:03 | embeddings_scale: True 12:22:03 | eval_batchsize: None 12:22:03 | evaltask: None 12:22:03 | ffn_size: 10240 12:22:03 | force_fp16_tokens: True 12:22:03 | fp16: True 12:22:03 | fp16_impl: mem_efficient 12:22:03 | gpu: -1 12:22:03 | gradient_clip: 0.1 12:22:03 | hide_labels: False 12:22:03 | history_add_global_end_token: end 12:22:03 | history_reversed: False 12:22:03 | history_size: -1 12:22:03 | image_cropsize: 224 12:22:03 | image_mode: raw 12:22:03 | image_size: 256 12:22:03 | include_checked_sentence: True 12:22:03 | include_initial_utterances: False 12:22:03 | include_knowledge: True 12:22:03 | include_knowledge_separator: False 12:22:03 | include_personas: True 12:22:03 | inference: beam 12:22:03 | init_model: /checkpoint/parlai/zoo/meena/20200319_meenav0data_tall_2.7B_adamoptimizer/20200319_13.3ppl_200kupdates/model 12:22:03 | init_opt: None 12:22:03 | interactive_mode: True 12:22:03 | interactive_task: True 12:22:03 | invsqrt_lr_decay_gamma: -1 12:22:03 | label_truncate: 128 12:22:03 | label_type: response 12:22:03 | learn_positional_embeddings: False 12:22:03 | learningrate: 7e-06 12:22:03 | local_human_candidates_file: None 12:22:03 | log_every_n_secs: 10.0 12:22:03 | loglevel: info 12:22:03 | lr_scheduler: reduceonplateau 12:22:03 | lr_scheduler_decay: 0.5 12:22:03 | lr_scheduler_patience: 3 12:22:03 | max_lr_steps: -1 12:22:03 | max_train_time: 27647.999999999996 12:22:03 | metrics: default 12:22:03 | model: transformer/generator 12:22:03 | model_file: /home/bard/ParlAI/data/models/blender/blender_3B/model 12:22:03 | model_parallel: True 12:22:03 | momentum: 0 12:22:03 | multitask_weights: '[1.0, 3.0, 3.0, 3.0]' 12:22:03 | n_decoder_layers: 24 12:22:03 | n_encoder_layers: 2 12:22:03 | n_heads: 32 12:22:03 | n_layers: 2 12:22:03 | n_positions: 128 12:22:03 | n_segments: 0 12:22:03 | nesterov: True 12:22:03 | no_cuda: False 12:22:03 | num_epochs: -1 12:22:03 | num_topics: 5 12:22:03 | numthreads: 1 12:22:03 | nus: [0.7] 12:22:03 | optimizer: mem_eff_adam 12:22:03 | output_scaling: 1.0 12:22:03 | override: "{'task': 'blended_skill_talk', 'model_file': '/home/bard/ParlAI/data/models/blender/blender_3B/model'}" 12:22:03 | parlai_home: /checkpoint/edinan/20200331/finetune_bst_gen_baseline_convai2_normal/ParlAI 12:22:03 | person_tokens: False 12:22:03 | rank_candidates: False 12:22:03 | relu_dropout: 0.0 12:22:03 | remove_political_convos: False 12:22:03 | safe_personas_only: True 12:22:03 | safety: all 12:22:03 | save_after_valid: True 12:22:03 | save_every_n_secs: -1 12:22:03 | share_word_embeddings: True 12:22:03 | short_final_eval: False 12:22:03 | show_advanced_args: False 12:22:03 | single_turn: False 12:22:03 | skip_generation: False 12:22:03 | special_tok_lst: None 12:22:03 | split_lines: False 12:22:03 | starttime: Mar31_06-04 12:22:03 | task: blended_skill_talk 12:22:03 | temperature: 1.0 12:22:03 | tensorboard_log: False 12:22:03 | text_truncate: 128 12:22:03 | topk: 10 12:22:03 | topp: 0.9 12:22:03 | train_experiencer_only: False 12:22:03 | truncate: 128 12:22:03 | update_freq: 2 12:22:03 | use_reply: label 12:22:03 | validation_cutoff: 1.0 12:22:03 | validation_every_n_epochs: 0.25 12:22:03 | validation_every_n_secs: -1 12:22:03 | validation_max_exs: -1 12:22:03 | validation_metric: ppl 12:22:03 | validation_metric_mode: min 12:22:03 | validation_patience: 10 12:22:03 | validation_share_agent: False 12:22:03 | variant: prelayernorm 12:22:03 | verbose: False 12:22:03 | warmup_rate: 0.0001 12:22:03 | warmup_updates: 100 12:22:03 | weight_decay: None 12:22:03 | Current ParlAI commit: 5104b2b954808ba4d0b92271dea0e771ace2924f Enter [DONE] if you want to end the episode, [EXIT] to quit. 12:22:03 | Overriding opt["model"] to transformer/classifier (previously: transformer_classifier) 12:22:03 | Overriding opt["model_file"] to /home/bard/ParlAI/data/models/dialogue_safety/single_turn/model (previously: /checkpoint/edinan/20190828/safety_reddit/contiguous-dropout=0_multitask-weights=0.5,0.1,0.1,0.4,0.2_lr=5e-05_lr-scheduler-patience=3_lr-scheduler-decay=0.9_warmupupdates=1000/model) 12:22:03 | Overriding opt["print_scores"] to True (previously: False) 12:22:03 | Overriding opt["data_parallel"] to False (previously: True) 12:22:03 | Using CUDA 12:22:03 | loading dictionary from /home/bard/ParlAI/data/models/dialogue_safety/single_turn/model.dict 12:22:03 | num words = 54944 12:22:05 | Loading existing model parameters from /home/bard/ParlAI/data/models/dialogue_safety/single_turn/model 12:22:06 | Total parameters: 128,042,498 (128,042,498 trainable) 12:22:06 | creating task(s): blended_skill_talk [ loading personas.. ]

[NOTE: In the BST paper both partners have a persona. You can choose to ignore yours, the model never sees it. In the Blender paper, this was not used for humans. You can also turn personas off with --include-personas False]

[context]: your persona: i am a registered nurse. your persona: my favorite movie is pretty woman. Enter Your Message: Hello /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [62,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [35,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [97,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [98,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [99,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [100,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [101,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [102,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [103,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [104,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [105,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [106,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [107,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [108,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [109,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [110,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [111,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [112,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [113,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [114,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [115,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [116,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [117,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [118,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [119,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [120,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [121,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [122,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [123,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [124,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [125,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [20,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [97,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [98,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [99,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [100,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [101,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [102,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [103,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [104,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [105,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [106,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [107,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [108,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [109,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [110,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [111,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [112,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [113,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [114,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [115,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [116,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [117,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [118,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [119,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [120,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [121,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [122,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [123,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [124,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [125,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: indexSelectSmallIndex: block: [23,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed. Traceback (most recent call last): File "parlai/scripts/safe_interactive.py", line 87, in SafeInteractive.main() File "/home/bard/ParlAI/parlai/core/script.py", line 111, in main return cls._run_args(None) File "/home/bard/ParlAI/parlai/core/script.py", line 84, in _run_args return cls._run_from_parser_and_opt(opt, parser) File "/home/bard/ParlAI/parlai/core/script.py", line 90, in _run_from_parser_and_opt return script.run() File "parlai/scripts/safe_interactive.py", line 82, in run return safe_interactive(self.opt) File "parlai/scripts/safe_interactive.py", line 62, in safe_interactive world.parley() File "/home/bard/ParlAI/parlai/tasks/interactive/worlds.py", line 78, in parley acts[1] = agents[1].act() File "/home/bard/ParlAI/parlai/core/torch_agent.py", line 1946, in act response = self.batch_act([self.observation])[0] File "/home/bard/ParlAI/parlai/core/torch_agent.py", line 2007, in batch_act output = self.eval_step(batch) File "/home/bard/ParlAI/parlai/core/torch_generator_agent.py", line 891, in eval_step beam_preds_scores, beams = self._generate(batch, self.beam_size, maxlen) File "/home/bard/ParlAI/parlai/core/torch_generator_agent.py", line 1167, in _generate incr_state, incr_state_inds File "/home/bard/ParlAI/parlai/agents/transformer/modules.py", line 1212, in reorder_decoder_incremental_state for idx, layer in enumerate(self.decoder.layers) File "/home/bard/ParlAI/parlai/agents/transformer/modules.py", line 1212, in for idx, layer in enumerate(self.decoder.layers) File "/home/bard/ParlAI/parlai/agents/transformer/modules.py", line 1066, in reorder_incremental_state for attn_type, attn in attn_types.items() File "/home/bard/ParlAI/parlai/agents/transformer/modules.py", line 1066, in for attn_type, attn in attn_types.items() File "/home/bard/ParlAI/parlai/agents/transformer/modules.py", line 1457, in reorder_incremental_state for key, val in incremental_state.items() File "/home/bard/ParlAI/parlai/agents/transformer/modules.py", line 1457, in for key, val in incremental_state.items() RuntimeError: CUDA error: device-side assert triggered ╭─░▒▓  ──────────────────────────────────────────────────────── 1 ✘  took 54s    ▓▒░─╮ ├─░▒▓ on   master !1 ?6   ▼  parlai   at 12:22:23 PM  ▓▒░─┤ ├─░▒▓  ~/ParlAI  ─┤ ╰─❯

stephenroller commented 3 years ago

Thanks, that's a very different part of the code, so glad we have the true stacktrace.

MikeyBeez commented 3 years ago

It's my pleasure. Please let me know if I start becoming a pain. I don't have certain filters nor restraints.

stephenroller commented 3 years ago

Can you try just a run of display_model with the same arguments instead of safe_interactive?

Do you have a second GPU by chance? If so, can you try adding --model-parallel true?

MikeyBeez commented 3 years ago

I have a strange setup with three GPUs. I have a 1050Ti with 4GBs of memory, and I have a Tesla K80 TPU with 24GBs. The K80 has two TPU chips. I set export CUDA_VISIBLE_DEVICES=2,1, so I don't use the 1050. display_model.py works. I have to be careful with some parallel settings because the 1050 gets grabbed and I get an out of memmory error, but --model-parallel true works fine here. Here's the output:
❯ python parlai/scripts/display_model.py -t blended_skill_talk -mf zoo:blender/blender_3B/model -bs=1 11:49:40 | Overriding opt["task"] to blended_skill_talk (previously: internal:blended_skill_talk,wizard_of_wikipedia,convai2:normalized,empathetic_dialogues) 11:49:40 | Overriding opt["model_file"] to /home/bard/ParlAI/data/models/blender/blender_3B/model (previously: /checkpoint/edinan/20200331/finetune_bst_gen_baseline_convai2_normal/de6/model) 11:49:40 | Loading model with --beam-block-full-context false 11:49:40 | Using CUDA 11:49:40 | loading dictionary from /home/bard/ParlAI/data/models/blender/blender_3B/model.dict 11:49:40 | num words = 8008 11:50:10 | Total parameters: 2,696,268,800 (2,695,613,440 trainable) 11:50:11 | Loading existing model params from /home/bard/ParlAI/data/models/blender/blender_3B/model 11:50:14 | creating task(s): blended_skill_talk 11:50:14 | Loading ParlAI text data: /home/bard/ParlAI/data/blended_skill_talk/valid.txt 11:50:14 | Opt: 11:50:14 | activation: gelu 11:50:14 | adafactor_eps: '[1e-30, 0.001]' 11:50:14 | adam_eps: 1e-08 11:50:14 | add_p1_after_newln: False 11:50:14 | aggregate_micro: False 11:50:14 | allow_missing_init_opts: False 11:50:14 | attention_dropout: 0.0 11:50:14 | batchsize: 128 11:50:14 | beam_block_full_context: False 11:50:14 | beam_block_list_filename: None 11:50:14 | beam_block_ngram: 3 11:50:14 | beam_context_block_ngram: 3 11:50:14 | beam_delay: 30 11:50:14 | beam_length_penalty: 0.65 11:50:14 | beam_min_length: 20 11:50:14 | beam_size: 10 11:50:14 | betas: '[0.9, 0.999]' 11:50:14 | bpe_add_prefix_space: True 11:50:14 | bpe_debug: False 11:50:14 | bpe_dropout: None 11:50:14 | bpe_merge: /home/bard/ParlAI/data/models/blender/blender_3B/model.dict-merges.txt 11:50:14 | bpe_vocab: /home/bard/ParlAI/data/models/blender/blender_3B/model.dict-vocab.json 11:50:14 | compute_tokenized_bleu: False 11:50:14 | datapath: /home/bard/ParlAI/data 11:50:14 | datatype: train 11:50:14 | delimiter: ' ' 11:50:14 | dict_class: parlai.core.dict:DictionaryAgent 11:50:14 | dict_endtoken: end 11:50:14 | dict_file: /home/bard/ParlAI/data/models/blender/blender_3B/model.dict 11:50:14 | dict_include_test: False 11:50:14 | dict_include_valid: False 11:50:14 | dict_initpath: None 11:50:14 | dict_language: english 11:50:14 | dict_loaded: True 11:50:14 | dict_lower: False 11:50:14 | dict_max_ngram_size: -1 11:50:14 | dict_maxexs: -1 11:50:14 | dict_maxtokens: -1 11:50:14 | dict_minfreq: 0 11:50:14 | dict_nulltoken: null 11:50:14 | dict_starttoken: start 11:50:14 | dict_textfields: text,labels 11:50:14 | dict_tokenizer: bytelevelbpe 11:50:14 | dict_unktoken: unk 11:50:14 | display_add_fields: 11:50:14 | display_examples: False 11:50:14 | download_path: None 11:50:14 | dropout: 0.1 11:50:14 | dynamic_batching: None 11:50:14 | embedding_projection: random 11:50:14 | embedding_size: 2560 11:50:14 | embedding_type: random 11:50:14 | embeddings_scale: True 11:50:14 | eval_batchsize: None 11:50:14 | evaltask: None 11:50:14 | ffn_size: 10240 11:50:14 | force_fp16_tokens: True 11:50:14 | fp16: True 11:50:14 | fp16_impl: mem_efficient 11:50:14 | gpu: -1 11:50:14 | gradient_clip: 0.1 11:50:14 | hide_labels: False 11:50:14 | history_add_global_end_token: end 11:50:14 | history_reversed: False 11:50:14 | history_size: -1 11:50:14 | image_cropsize: 224 11:50:14 | image_mode: raw 11:50:14 | image_size: 256 11:50:14 | include_checked_sentence: True 11:50:14 | include_knowledge: True 11:50:14 | include_knowledge_separator: False 11:50:14 | inference: beam 11:50:14 | init_model: /checkpoint/parlai/zoo/meena/20200319_meenav0data_tall_2.7B_adamoptimizer/20200319_13.3ppl_200kupdates/model 11:50:14 | init_opt: None 11:50:14 | interactive_mode: False 11:50:14 | invsqrt_lr_decay_gamma: -1 11:50:14 | label_truncate: 128 11:50:14 | label_type: response 11:50:14 | learn_positional_embeddings: False 11:50:14 | learningrate: 7e-06 11:50:14 | log_every_n_secs: 10.0 11:50:14 | loglevel: info 11:50:14 | lr_scheduler: reduceonplateau 11:50:14 | lr_scheduler_decay: 0.5 11:50:14 | lr_scheduler_patience: 3 11:50:14 | max_lr_steps: -1 11:50:14 | max_train_time: 27647.999999999996 11:50:14 | metrics: default 11:50:14 | model: transformer/generator 11:50:14 | model_file: /home/bard/ParlAI/data/models/blender/blender_3B/model 11:50:14 | model_parallel: True 11:50:14 | momentum: 0 11:50:14 | multitask_weights: '[1.0, 3.0, 3.0, 3.0]' 11:50:14 | n_decoder_layers: 24 11:50:14 | n_encoder_layers: 2 11:50:14 | n_heads: 32 11:50:14 | n_layers: 2 11:50:14 | n_positions: 128 11:50:14 | n_segments: 0 11:50:14 | nesterov: True 11:50:14 | no_cuda: False 11:50:14 | num_epochs: -1 11:50:14 | num_examples: 10 11:50:14 | num_topics: 5 11:50:14 | numthreads: 1 11:50:14 | nus: [0.7] 11:50:14 | optimizer: mem_eff_adam 11:50:14 | output_scaling: 1.0 11:50:14 | override: "{'task': 'blended_skill_talk', 'model_file': '/home/bard/ParlAI/data/models/blender/blender_3B/model'}" 11:50:14 | parlai_home: /checkpoint/edinan/20200331/finetune_bst_gen_baseline_convai2_normal/ParlAI 11:50:14 | person_tokens: False 11:50:14 | rank_candidates: False 11:50:14 | relu_dropout: 0.0 11:50:14 | remove_political_convos: False 11:50:14 | save_after_valid: True 11:50:14 | save_every_n_secs: -1 11:50:14 | share_word_embeddings: True 11:50:14 | short_final_eval: False 11:50:14 | show_advanced_args: False 11:50:14 | skip_generation: False 11:50:14 | special_tok_lst: None 11:50:14 | split_lines: False 11:50:14 | starttime: Mar31_06-04 11:50:14 | task: blended_skill_talk 11:50:14 | temperature: 1.0 11:50:14 | tensorboard_log: False 11:50:14 | text_truncate: 128 11:50:14 | topk: 10 11:50:14 | topp: 0.9 11:50:14 | train_experiencer_only: False 11:50:14 | truncate: 128 11:50:14 | update_freq: 2 11:50:14 | use_reply: label 11:50:14 | validation_cutoff: 1.0 11:50:14 | validation_every_n_epochs: 0.25 11:50:14 | validation_every_n_secs: -1 11:50:14 | validation_max_exs: -1 11:50:14 | validation_metric: ppl 11:50:14 | validation_metric_mode: min 11:50:14 | validation_patience: 10 11:50:14 | validation_share_agent: False 11:50:14 | variant: prelayernorm 11:50:14 | verbose: False 11:50:14 | warmup_rate: 0.0001 11:50:14 | warmup_updates: 100 11:50:14 | weight_decay: None 11:50:14 | Current ParlAI commit: 4fd58a3ed7ea9dac692abf6a9981219c8ef5b7bd

MikeyBeez commented 3 years ago

BTW, I interrupted that run. Then I re-ran it with --model-parallel true, and it ran fine to the end.

MikeyBeez commented 3 years ago

Here's the same job with export CUDA_LAUNCH_BLOCKING=1

❯ python parlai/scripts/display_model.py -t blended_skill_talk -mf zoo:blender/blender_3B/model -bs=1 --model-parallel true 12:05:55 | Overriding opt["task"] to blended_skill_talk (previously: internal:blended_skill_talk,wizard_of_wikipedia,convai2:normalized,empathetic_dialogues) 12:05:55 | Overriding opt["model_file"] to /home/bard/ParlAI/data/models/blender/blender_3B/model (previously: /checkpoint/edinan/20200331/finetune_bst_gen_baseline_convai2_normal/de6/model) 12:05:55 | Loading model with --beam-block-full-context false 12:05:55 | Using CUDA 12:05:55 | loading dictionary from /home/bard/ParlAI/data/models/blender/blender_3B/model.dict 12:05:55 | num words = 8008 12:06:24 | Total parameters: 2,696,268,800 (2,695,613,440 trainable) 12:06:25 | Loading existing model params from /home/bard/ParlAI/data/models/blender/blender_3B/model 12:06:27 | creating task(s): blended_skill_talk 12:06:27 | Loading ParlAI text data: /home/bard/ParlAI/data/blended_skill_talk/valid.txt 12:06:27 | Opt: 12:06:27 | activation: gelu 12:06:27 | adafactor_eps: '[1e-30, 0.001]' 12:06:27 | adam_eps: 1e-08 12:06:27 | add_p1_after_newln: False 12:06:27 | aggregate_micro: False 12:06:27 | allow_missing_init_opts: False 12:06:27 | attention_dropout: 0.0 12:06:27 | batchsize: 128 12:06:27 | beam_block_full_context: False 12:06:27 | beam_block_list_filename: None 12:06:27 | beam_block_ngram: 3 12:06:27 | beam_context_block_ngram: 3 12:06:27 | beam_delay: 30 12:06:27 | beam_length_penalty: 0.65 12:06:27 | beam_min_length: 20 12:06:27 | beam_size: 10 12:06:27 | betas: '[0.9, 0.999]' 12:06:27 | bpe_add_prefix_space: True 12:06:27 | bpe_debug: False 12:06:27 | bpe_dropout: None 12:06:27 | bpe_merge: /home/bard/ParlAI/data/models/blender/blender_3B/model.dict-merges.txt 12:06:27 | bpe_vocab: /home/bard/ParlAI/data/models/blender/blender_3B/model.dict-vocab.json 12:06:27 | compute_tokenized_bleu: False 12:06:27 | datapath: /home/bard/ParlAI/data 12:06:27 | datatype: train 12:06:27 | delimiter: ' ' 12:06:27 | dict_class: parlai.core.dict:DictionaryAgent 12:06:27 | dict_endtoken: end 12:06:27 | dict_file: /home/bard/ParlAI/data/models/blender/blender_3B/model.dict 12:06:27 | dict_include_test: False 12:06:27 | dict_include_valid: False 12:06:27 | dict_initpath: None 12:06:27 | dict_language: english 12:06:27 | dict_loaded: True 12:06:27 | dict_lower: False 12:06:27 | dict_max_ngram_size: -1 12:06:27 | dict_maxexs: -1 12:06:27 | dict_maxtokens: -1 12:06:27 | dict_minfreq: 0 12:06:27 | dict_nulltoken: null 12:06:27 | dict_starttoken: start 12:06:27 | dict_textfields: text,labels 12:06:27 | dict_tokenizer: bytelevelbpe 12:06:27 | dict_unktoken: unk 12:06:27 | display_add_fields: 12:06:27 | display_examples: False 12:06:27 | download_path: None 12:06:27 | dropout: 0.1 12:06:27 | dynamic_batching: None 12:06:27 | embedding_projection: random 12:06:27 | embedding_size: 2560 12:06:27 | embedding_type: random 12:06:27 | embeddings_scale: True 12:06:27 | eval_batchsize: None 12:06:27 | evaltask: None 12:06:27 | ffn_size: 10240 12:06:27 | force_fp16_tokens: True 12:06:27 | fp16: True 12:06:27 | fp16_impl: mem_efficient 12:06:27 | gpu: -1 12:06:27 | gradient_clip: 0.1 12:06:27 | hide_labels: False 12:06:27 | history_add_global_end_token: end 12:06:27 | history_reversed: False 12:06:27 | history_size: -1 12:06:27 | image_cropsize: 224 12:06:27 | image_mode: raw 12:06:27 | image_size: 256 12:06:27 | include_checked_sentence: True 12:06:27 | include_knowledge: True 12:06:27 | include_knowledge_separator: False 12:06:27 | inference: beam 12:06:27 | init_model: /checkpoint/parlai/zoo/meena/20200319_meenav0data_tall_2.7B_adamoptimizer/20200319_13.3ppl_200kupdates/model 12:06:27 | init_opt: None 12:06:27 | interactive_mode: False 12:06:27 | invsqrt_lr_decay_gamma: -1 12:06:27 | label_truncate: 128 12:06:27 | label_type: response 12:06:27 | learn_positional_embeddings: False 12:06:27 | learningrate: 7e-06 12:06:27 | log_every_n_secs: 10.0 12:06:27 | loglevel: info 12:06:27 | lr_scheduler: reduceonplateau 12:06:27 | lr_scheduler_decay: 0.5 12:06:27 | lr_scheduler_patience: 3 12:06:27 | max_lr_steps: -1 12:06:27 | max_train_time: 27647.999999999996 12:06:27 | metrics: default 12:06:27 | model: transformer/generator 12:06:27 | model_file: /home/bard/ParlAI/data/models/blender/blender_3B/model 12:06:27 | model_parallel: True 12:06:27 | momentum: 0 12:06:27 | multitask_weights: '[1.0, 3.0, 3.0, 3.0]' 12:06:27 | n_decoder_layers: 24 12:06:27 | n_encoder_layers: 2 12:06:27 | n_heads: 32 12:06:27 | n_layers: 2 12:06:27 | n_positions: 128 12:06:27 | n_segments: 0 12:06:27 | nesterov: True 12:06:27 | no_cuda: False 12:06:27 | num_epochs: -1 12:06:27 | num_examples: 10 12:06:27 | num_topics: 5 12:06:27 | numthreads: 1 12:06:27 | nus: [0.7] 12:06:27 | optimizer: mem_eff_adam 12:06:27 | output_scaling: 1.0 12:06:27 | override: "{'task': 'blended_skill_talk', 'model_file': '/home/bard/ParlAI/data/models/blender/blender_3B/model', 'model_parallel': True}" 12:06:27 | parlai_home: /checkpoint/edinan/20200331/finetune_bst_gen_baseline_convai2_normal/ParlAI 12:06:27 | person_tokens: False 12:06:27 | rank_candidates: False 12:06:27 | relu_dropout: 0.0 12:06:27 | remove_political_convos: False 12:06:27 | save_after_valid: True 12:06:27 | save_every_n_secs: -1 12:06:27 | share_word_embeddings: True 12:06:27 | short_final_eval: False 12:06:27 | show_advanced_args: False 12:06:27 | skip_generation: False 12:06:27 | special_tok_lst: None 12:06:27 | split_lines: False 12:06:27 | starttime: Mar31_06-04 12:06:27 | task: blended_skill_talk 12:06:27 | temperature: 1.0 12:06:27 | tensorboard_log: False 12:06:27 | text_truncate: 128 12:06:27 | topk: 10 12:06:27 | topp: 0.9 12:06:27 | train_experiencer_only: False 12:06:27 | truncate: 128 12:06:27 | update_freq: 2 12:06:27 | use_reply: label 12:06:27 | validation_cutoff: 1.0 12:06:27 | validation_every_n_epochs: 0.25 12:06:27 | validation_every_n_secs: -1 12:06:27 | validation_max_exs: -1 12:06:27 | validation_metric: ppl 12:06:27 | validation_metric_mode: min 12:06:27 | validation_patience: 10 12:06:27 | validation_share_agent: False 12:06:27 | variant: prelayernorm 12:06:27 | verbose: False 12:06:27 | warmup_rate: 0.0001 12:06:27 | warmup_updates: 100 12:06:27 | weight_decay: None 12:06:28 | Current ParlAI commit: 4fd58a3ed7ea9dac692abf6a9981219c8ef5b7bd

stephenroller commented 3 years ago

I'm going to close this issue since it seems like --model-parallel true with careful options works well. Reopen if you have further questions.

While it's fine to have the heterogeneous setup, our implementation assumes a homogenous one, and therefore may distribute weights non-optimally across the devices.

MikeyBeez commented 3 years ago

Thank you Stephen. I just returned today from a business trip to Arkansas. I think it-s fine to close this.