facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.49k stars 2.1k forks source link

Parlai dispaly_model not working properly #4985

Closed TheMrguiller closed 1 year ago

TheMrguiller commented 1 year ago

Bug description Once I have trained my model from the zoo:tutorial_transformer_generator/model, I wnated to check how the model was performing. To do so, I use the following command line:

parlai display_model -t babi:task10k:1 --model-file C:/Users/superserver/Desktop/guillermo/Parlai/model/modelo --skip-generation false

Reproduction steps

Every time I perform these command in different task i get the same error. I thought at first it was because of my task. But it seems its not. I have trained the model using the following command:

parlai train_model  -t fromfile:parlaiformat --fromfile_datapath "C:\Users\superserver\Desktop\guillermo\Parlai\toxic_fixed.txt" -m transformer/generator --init-model zoo:tutorial_transformer_generator/model --dict-file zoo:tutorial_transformer_generator/model.dict --embedding-size 512 --n-layers 8 --ffn-size 2048 --dropout 0.1 --n-heads 16 --learn-positional-embeddings True --n-positions 512 --variant xlm --activation gelu --skip-generation True --fp16 True --text-truncate 512 --label-truncate 128 --dict-tokenizer bpe --dict-lower True -lr 1e-06 --optimizer adamax --lr-scheduler reduceonplateau --gradient-clip 0.1 -veps 0.25 --betas 0.9,0.999 --update-freq 1 --attention-dropout 0.0 --relu-dropout 0.0 --skip-generation True -vp 15 -stim 60 -vme 20000 -bs 32 -vmt ppl -vmm min --save-after-valid True --model-file C:/Users/superserver/Desktop/guillermo/Parlai/model/modelo --gpu 0 --tensorboard_log True --tensorboard_logdir C:/Users/superserver/Desktop/guillermo/Parlai/tensorboard --seed 42 --num-workers 8

Expected behavior I expected to have some kind of output. If I do it without the --skip-generation false, it works but i dont get a model response.

Logs Please paste the command line output:

C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\torch\utils\cpp_extension.py:359: UserWarning: Error checking compiler version for cl: [WinError 2] El sistema no puede encontrar el archivo especificado
  warnings.warn(f'Error checking compiler version for {compiler}: {error}')
INFORMACIÓN: no se pudo encontrar ningún archivo para los patrones dados.
12:17:32 | Unable to load ngram blocking on GPU: Command '['where', 'cl']' returned non-zero exit status 1.
12:17:32 | Overriding opt["task"] to babi:task10k:1 (previously: fromfile:parlaiformat)
12:17:32 | Overriding opt["skip_generation"] to False (previously: True)
12:17:32 | Using CUDA
12:17:32 | loading dictionary from C:/Users/superserver/Desktop/guillermo/Parlai/model/modelo.dict
12:17:32 | num words = 54944
12:17:33 | DEPRECATED: XLM should only be used for backwards compatibility, as it involves a less-stable layernorm operation.
12:17:34 | Total parameters: 87,508,992 (87,508,992 trainable)
12:17:34 | Loading existing model params from C:/Users/superserver/Desktop/guillermo/Parlai/model/modelo
12:17:36 | creating task(s): babi:task10k:1
[building data: C:\Users\superserver\Desktop\guillermo\Parlai\parlai\Lib\site-packages\data\bAbI]
12:17:36 | Downloading http://parl.ai/downloads/babi/babi.tar.gz to C:\Users\superserver\Desktop\guillermo\Parlai\parlai\Lib\site-packages\data\bAbI\babi.tar.gz
Downloading babi.tar.gz: 100%|████████████████████████████████████████████████████| 19.2M/19.2M [00:03<00:00, 5.54MB/s]
12:17:42 | Tried to delete C:\Users\superserver\Desktop\guillermo\Parlai\parlai\Lib\site-packages\data\bAbI\babi.tar.gz but got a permission error. This is known to happen in Windows and is probably not fatal.
12:17:42 | loading fbdialog data: C:\Users\superserver\Desktop\guillermo\Parlai\parlai\Lib\site-packages\data\bAbI\tasks_1-20_v1-2\en-valid-10k-nosf\qa1_valid.txt
12:17:42 | Opt:
12:17:42 |     activation: gelu
12:17:42 |     adafactor_eps: '[1e-30, 0.001]'
12:17:42 |     adam_eps: 1e-08
12:17:42 |     add_p1_after_newln: False
12:17:42 |     aggregate_micro: False
12:17:42 |     allow_missing_init_opts: False
12:17:42 |     attention_dropout: 0.0
12:17:42 |     batchsize: 32
12:17:42 |     beam_block_full_context: True
12:17:42 |     beam_block_list_filename: None
12:17:42 |     beam_block_ngram: -1
12:17:42 |     beam_context_block_ngram: -1
12:17:42 |     beam_delay: 30
12:17:42 |     beam_length_penalty: 0.65
12:17:42 |     beam_min_length: 1
12:17:42 |     beam_size: 1
12:17:42 |     betas: '[0.9, 0.999]'
12:17:42 |     bpe_add_prefix_space: None
12:17:42 |     bpe_debug: False
12:17:42 |     bpe_dropout: None
12:17:42 |     bpe_merge: None
12:17:42 |     bpe_vocab: None
12:17:42 |     checkpoint_activations: False
12:17:42 |     compute_tokenized_bleu: False
12:17:42 |     datapath: C:\Users\superserver\Desktop\guillermo\Parlai\parlai\Lib\site-packages\data
12:17:42 |     datatype: train
12:17:42 |     delimiter: '\n'
12:17:42 |     dict_class: parlai.core.dict:DictionaryAgent
12:17:42 |     dict_endtoken: __end__
12:17:42 |     dict_file: C:/Users/superserver/Desktop/guillermo/Parlai/model/modelo.dict
12:17:42 |     dict_include_test: False
12:17:42 |     dict_include_valid: False
12:17:42 |     dict_initpath: None
12:17:42 |     dict_language: english
12:17:42 |     dict_loaded: True
12:17:42 |     dict_lower: True
12:17:42 |     dict_max_ngram_size: -1
12:17:42 |     dict_maxexs: -1
12:17:42 |     dict_maxtokens: -1
12:17:42 |     dict_minfreq: 0
12:17:42 |     dict_nulltoken: __null__
12:17:42 |     dict_starttoken: __start__
12:17:42 |     dict_textfields: text,labels
12:17:42 |     dict_tokenizer: bpe
12:17:42 |     dict_unktoken: __unk__
12:17:42 |     display_add_fields:
12:17:42 |     display_examples: False
12:17:42 |     download_path: None
12:17:42 |     dropout: 0.1
12:17:42 |     dynamic_batching: None
12:17:42 |     embedding_projection: random
12:17:42 |     embedding_size: 512
12:17:42 |     embedding_type: random
12:17:42 |     embeddings_scale: True
12:17:42 |     eval_batchsize: None
12:17:42 |     eval_dynamic_batching: None
12:17:42 |     evaltask: None
12:17:42 |     ffn_size: 2048
12:17:42 |     final_extra_opt:
12:17:42 |     force_fp16_tokens: True
12:17:42 |     fp16: True
12:17:42 |     fp16_impl: safe
12:17:42 |     fromfile_datapath: C:\Users\superserver\Desktop\guillermo\Parlai\toxic_fixed.txt
12:17:42 |     fromfile_datatype_extension: False
12:17:42 |     gpu: 0
12:17:42 |     gpu_beam_blocking: False
12:17:42 |     gradient_clip: 0.1
12:17:42 |     hide_labels: False
12:17:42 |     history_add_global_end_token: None
12:17:42 |     history_reversed: False
12:17:42 |     history_size: -1
12:17:42 |     image_cropsize: 224
12:17:42 |     image_mode: raw
12:17:42 |     image_size: 256
12:17:42 |     inference: greedy
12:17:42 |     init_model: C:\Users\superserver\Desktop\guillermo\Parlai\parlai\Lib\site-packages\data\models\tutorial_transformer_generator/model
12:17:42 |     init_opt: None
12:17:42 |     interactive_mode: False
12:17:42 |     invsqrt_lr_decay_gamma: -1
12:17:42 |     is_debug: False
12:17:42 |     label_truncate: 128
12:17:42 |     learn_positional_embeddings: True
12:17:42 |     learningrate: 1e-06
12:17:42 |     log_every_n_secs: -1
12:17:42 |     log_every_n_steps: 50
12:17:42 |     log_keep_fields: all
12:17:42 |     loglevel: info
12:17:42 |     lr_scheduler: reduceonplateau
12:17:42 |     lr_scheduler_decay: 0.5
12:17:42 |     lr_scheduler_patience: 3
12:17:42 |     max_train_steps: -1
12:17:42 |     max_train_time: -1
12:17:42 |     metrics: default
12:17:42 |     model: transformer/generator
12:17:42 |     model_file: C:/Users/superserver/Desktop/guillermo/Parlai/model/modelo
12:17:42 |     model_parallel: False
12:17:42 |     momentum: 0
12:17:42 |     multitask_weights: [1]
12:17:42 |     mutators: None
12:17:42 |     n_decoder_layers: -1
12:17:42 |     n_encoder_layers: -1
12:17:42 |     n_heads: 16
12:17:42 |     n_layers: 8
12:17:42 |     n_positions: 512
12:17:42 |     n_segments: 0
12:17:42 |     nesterov: True
12:17:42 |     no_cuda: False
12:17:42 |     num_epochs: -1
12:17:42 |     num_examples: 10
12:17:42 |     num_workers: 0
12:17:42 |     nus: [0.7]
12:17:42 |     optimizer: adamax
12:17:42 |     output_scaling: 1.0
12:17:42 |     override: "{'task': 'babi:task10k:1', 'model_file': 'C:/Users/superserver/Desktop/guillermo/Parlai/model/modelo', 'skip_generation': False}"
12:17:42 |     parlai_home: C:\Users\superserver\Desktop\guillermo\Parlai\parlai\Lib\site-packages
12:17:42 |     person_tokens: False
12:17:42 |     rank_candidates: False
12:17:42 |     relu_dropout: 0.0
12:17:42 |     save_after_valid: True
12:17:42 |     save_every_n_secs: 60.0
12:17:42 |     save_format: conversations
12:17:42 |     seed: 42
12:17:42 |     share_word_embeddings: True
12:17:42 |     short_final_eval: False
12:17:42 |     skip_generation: False
12:17:42 |     special_tok_lst: None
12:17:42 |     split_lines: False
12:17:42 |     starttime: Mar16_17-40
12:17:42 |     task: babi:task10k:1
12:17:42 |     teacher_seed: None
12:17:42 |     temperature: 1.0
12:17:42 |     tensorboard_log: True
12:17:42 |     tensorboard_logdir: C:/Users/superserver/Desktop/guillermo/Parlai/tensorboard
12:17:42 |     text_truncate: 512
12:17:42 |     topk: 10
12:17:42 |     topp: 0.9
12:17:42 |     truncate: -1
12:17:42 |     update_freq: 1
12:17:42 |     use_reply: label
12:17:42 |     validation_cutoff: 1.0
12:17:42 |     validation_every_n_epochs: 0.25
12:17:42 |     validation_every_n_secs: -1
12:17:42 |     validation_every_n_steps: -1
12:17:42 |     validation_max_exs: 20000
12:17:42 |     validation_metric: ppl
12:17:42 |     validation_metric_mode: min
12:17:42 |     validation_patience: 15
12:17:42 |     validation_share_agent: False
12:17:42 |     variant: xlm
12:17:42 |     verbose: False
12:17:42 |     wandb_entity: None
12:17:42 |     wandb_log: False
12:17:42 |     wandb_log_model: False
12:17:42 |     wandb_name: None
12:17:42 |     wandb_project: None
12:17:42 |     warmup_rate: 0.0001
12:17:42 |     warmup_updates: -1
12:17:42 |     weight_decay: None
12:17:42 |     world_logs:
Traceback (most recent call last):
  File "c:\users\superserver\appdata\local\programs\python\python38\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\users\superserver\appdata\local\programs\python\python38\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\Scripts\parlai.exe\__main__.py", line 7, in <module>
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\__main__.py", line 14, in main
    superscript_main()
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\core\script.py", line 325, in superscript_main
    return SCRIPT_REGISTRY[cmd].klass._run_from_parser_and_opt(opt, parser)
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\core\script.py", line 108, in _run_from_parser_and_opt
    return script.run()
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\scripts\display_model.py", line 91, in run
    display_model(self.opt)
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\scripts\display_model.py", line 70, in display_model
    world.parley()
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\core\worlds.py", line 370, in parley
    acts[1] = agents[1].act()
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\core\torch_agent.py", line 2148, in act
    response = self.batch_act([self.observation])[0]
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\core\torch_agent.py", line 2244, in batch_act
    output = self.eval_step(batch)
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\core\torch_generator_agent.py", line 901, in eval_step
    beam_preds_scores, beams = self._generate(
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\core\torch_generator_agent.py", line 1223, in _generate
    b.advance(score[i], _ts)
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\core\torch_generator_agent.py", line 1599, in advance
    self.partial_hyps[path_selection.hypothesis_ids.long()],
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

Additional context Its my first training so I think I may have done something wrong.

mojtaba-komeili commented 1 year ago

It looks like some error related to your GPU (maybe CUDA install). Given that you are running it on MS Windows, my first guess would be the OS-related issues. Could you try the run with --no-cuda option to make sure that your command runs? This will be very slow as you only run your training on CPU, but at least we make sure that everything else is correct. If so, maybe try your experiment on a Linux machine, or use our Docker image.

TheMrguiller commented 1 year ago

The command doesnt work, it gives an error realted to NfGramRepeatBlock

(parlai) PS C:\Users\superserver\Desktop\guillermo\Parlai> parlai safe_interactive -t blended_skill_talk -mf zoo:blender/blender_90M/model --no-cuda
15:51:55 | Unable to load ngram blocking on GPU: Error building extension 'ngram_repeat_block_cuda': [1/1] "C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\bin\Hostx64\x64/link.exe" ngram_repeat_block_cuda.o ngram_repeat_block_cuda_kernel.cuda.o /nologo /DLL c10.lib c10_cuda.lib torch_cpu.lib torch_cuda.lib -INCLUDE:?warp_size@cuda@at@@YAHXZ torch.lib /LIBPATH:C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\torch\lib torch_python.lib /LIBPATH:C:\Users\superserver\Desktop\guillermo\Parlai\parlai\Scripts\libs "/LIBPATH:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.0\lib\x64" cudart.lib /out:ngram_repeat_block_cuda.pyd
FAILED: ngram_repeat_block_cuda.pyd
"C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\bin\Hostx64\x64/link.exe" ngram_repeat_block_cuda.o ngram_repeat_block_cuda_kernel.cuda.o /nologo /DLL c10.lib c10_cuda.lib torch_cpu.lib torch_cuda.lib -INCLUDE:?warp_size@cuda@at@@YAHXZ torch.lib /LIBPATH:C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\torch\lib torch_python.lib /LIBPATH:C:\Users\superserver\Desktop\guillermo\Parlai\parlai\Scripts\libs "/LIBPATH:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.0\lib\x64" cudart.lib /out:ngram_repeat_block_cuda.pyd
LINK : fatal error LNK1104: no se puede abrir el archivo 'python38.lib'
ninja: build stopped: subcommand failed.

Traceback (most recent call last):
  File "c:\users\superserver\appdata\local\programs\python\python38\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\users\superserver\appdata\local\programs\python\python38\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\Scripts\parlai.exe\__main__.py", line 7, in <module>
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\__main__.py", line 14, in main
    superscript_main()
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\core\script.py", line 247, in superscript_main
    setup_script_registry()
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\core\script.py", line 37, in setup_script_registry
    importlib.import_module(module.name)
  File "c:\users\superserver\appdata\local\programs\python\python38\lib\importlib\__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\scripts\detect_offensive_language.py", line 19, in <module>
    from parlai.utils.safety import OffensiveStringMatcher, OffensiveLanguageClassifier
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\utils\safety.py", line 10, in <module>
    from parlai.agents.transformer.transformer import TransformerClassifierAgent
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\agents\transformer\transformer.py", line 15, in <module>
    from parlai.core.torch_generator_agent import TorchGeneratorAgent
  File "C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\core\torch_generator_agent.py", line 48, in <module>
    from parlai.ops.ngram_repeat_block import NfGramRepeatBlock
ImportError: cannot import name 'NfGramRepeatBlock' from 'parlai.ops.ngram_repeat_block' (C:\Users\superserver\Desktop\guillermo\Parlai\parlai\lib\site-packages\parlai\ops\ngram_repeat_block.py)
klshuster commented 1 year ago

What pytorch version are you using? This is a known issue

TheMrguiller commented 1 year ago

I am currently using the last version of pytorch.

mojtaba-komeili commented 1 year ago

But ParlAI has certain requirements for the PyTorch version: see this.

TheMrguiller commented 1 year ago

I will check it, but i did try it by using the requirements.txt. If the problem continues you recomend using docker? I have another question related to the model once is it trained. Can it be saved in another format to use it for example to upload it to huggingface or to just use it with tensorflow?

klshuster commented 1 year ago

we do not offer alternative model saving formats at the moment

are you on the latest version of ParlAI?

klshuster commented 1 year ago

This is the proper fix, it may not be in the latest parlai release: https://github.com/facebookresearch/ParlAI/pull/4887

I think if you were to downgrade your pytorch to < 1.13 it would also solve the issue

github-actions[bot] commented 1 year ago

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.