NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.09k stars 2.52k forks source link

omegaconf.errors.ConfigKeyError: Key 'use_tarred_dataset' not in 'MTEncDecConfig' #5579

Closed felicitywang1 closed 1 year ago

felicitywang1 commented 1 year ago

Hi, I'm running nmt traininig following https://github.com/NVIDIA/NeMo/pull/1738 with script

# in /group-volume/Bixby-Compression/sharing/felicity.wang/NeMo/examples/nlp/machine_translation
python3 enc_dec_nmt.py \
      --config-path=conf \
      --config-name=aayn_base \
      trainer.devices=[0] \
      ~trainer.max_epochs \
      +trainer.max_steps=100 \
      model.beam_size=4 \
      model.max_generation_delta=5 \
      model.label_smoothing=0.1 \
      model.encoder_tokenizer.vocab_size=32000 \
      model.decoder_tokenizer.vocab_size=32000 \
      +model.encoder_tokenizer.bpe_dropout=0.1 \
      +model.decoder_tokenizer.bpe_dropout=0.1 \
      model.preproc_out_dir=tmp_data/en_es_preproc \
      model.train_ds.src_file_name=tmp_data/WikiMatrix.en-fr.langidfilter.lengthratio.bicleaner.60.dedup.moses.norm.tok.shuf.en \
      model.train_ds.tgt_file_name=tmp_data/WikiMatrix.en-fr.langidfilter.lengthratio.bicleaner.60.dedup.moses.norm.tok.shuf.fr \
      model.validation_ds.src_file_name=data/wmt14.en-fr.en \
      model.validation_ds.tgt_file_name=data/wmt14.en-fr.fr \
      model.test_ds.src_file_name=data/wmt13.en-fr.en \
      model.test_ds.tgt_file_name=data/wmt13.en-fr.fr \
      +use_tarred_dataset=True \
      model.encoder.num_layers=6 \
      model.encoder.hidden_size=256 \
      model.encoder.inner_size=1024 \
      model.encoder.num_attention_heads=8 \
      model.encoder.ffn_dropout=0.1 \
      model.decoder.num_layers=6 \
      model.decoder.hidden_size=256 \
      model.decoder.inner_size=1024 \
      model.decoder.num_attention_heads=8 \
      model.decoder.ffn_dropout=0.1 \
      model.train_ds.tokens_in_batch=12500 \
      model.validation_ds.tokens_in_batch=8192 \
      model.optim.lr=0.001  \
      model.optim.sched.warmup_ratio=0.05 \
      +exp_manager.create_wandb_logger=True \
      +exp_manager.wandb_logger_kwargs.name=TEST-nmt-base \
      +exp_manager.wandb_logger_kwargs.project=nmt-de-en \
      +exp_manager.create_checkpoint_callback=True \
      +exp_manager.checkpoint_callback_params.monitor=val_sacreBLEU \
      +exp_manager.exp_dir=nmt_base \
      +exp_manager.checkpoint_callback_params.mode=max

trying to do BPE training and model training together within this script and got

...

Error executing job with overrides: ['trainer.devices=[0]', '~trainer.max_epochs', '+trainer.max_steps=100', 'model.beam_size=4', 'model.max_generation_delta=5', 'model.label_smoothing=0.1', 'model.encoder_tokenizer.vocab_size=32000', 'model.decoder_tokenizer.vocab_size=32000', '+model.encoder_tokenizer.bpe_dropout=0.1', '+model.decoder_tokenizer.bpe_dropout=0.1', 'model.preproc_out_dir=tmp_data/en_es_preproc', 'model.train_ds.src_file_name=tmp_data/WikiMatrix.en-fr.langidfilter.lengthratio.bicleaner.60.dedup.moses.norm.tok.shuf.en', 'model.train_ds.tgt_file_name=tmp_data/WikiMatrix.en-fr.langidfilter.lengthratio.bicleaner.60.dedup.moses.norm.tok.shuf.fr', 'model.validation_ds.src_file_name=data/wmt14.en-fr.en', 'model.validation_ds.tgt_file_name=data/wmt14.en-fr.fr', 'model.test_ds.src_file_name=data/wmt13.en-fr.en', 'model.test_ds.tgt_file_name=data/wmt13.en-fr.fr', '+use_tarred_dataset=True', 'model.encoder.num_layers=6', 'model.encoder.hidden_size=256', 'model.encoder.inner_size=1024', 'model.encoder.num_attention_heads=8', 'model.encoder.ffn_dropout=0.1', 'model.decoder.num_layers=6', 'model.decoder.hidden_size=256', 'model.decoder.inner_size=1024', 'model.decoder.num_attention_heads=8', 'model.decoder.ffn_dropout=0.1', 'model.train_ds.tokens_in_batch=12500', 'model.validation_ds.tokens_in_batch=8192', 'model.optim.lr=0.001', 'model.optim.sched.warmup_ratio=0.05', '+exp_manager.create_wandb_logger=True', '+exp_manager.wandb_logger_kwargs.name=TEST-nmt-base', '+exp_manager.wandb_logger_kwargs.project=nmt-de-en', '+exp_manager.create_checkpoint_callback=True', '+exp_manager.checkpoint_callback_params.monitor=val_sacreBLEU', '+exp_manager.exp_dir=nmt_base', '+exp_manager.checkpoint_callback_params.mode=max']
Traceback (most recent call last):
  File "enc_dec_nmt.py", line 107, in main
    cfg = update_model_config(default_cfg, cfg)
  File "/home/user/.local/lib/python3.8/site-packages/nemo/utils/config_utils.py", line 105, in update_model_config
    model_cfg = OmegaConf.merge(model_cls, update_cfg)
omegaconf.errors.ConfigKeyError: Key 'use_tarred_dataset' not in 'MTEncDecConfig'
    full_key: use_tarred_dataset
    object_type=MTEncDecConfig

My environments are

nemo-toolkit                  1.13.0
hydra-core                    1.1.2
omegaconf                     2.1.2

Any guidance how I can solve this? Or could you provide the exact environment versions for training nmt?

Thank you.

ericharper commented 1 year ago

Could you try running with our latest container? https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags

Also, our latest machine translation models use Megatron T5 models. You could try this script as well: https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/machine_translation/megatron_nmt_training.py