Can't start to finetune and can't pull nvcr.io/nvidia/nemo:24.01 or nvcr.io/ea-bignlp/ga-participants/nemofw-training:24.01

Describe the bug

Follow NeMo Framework PEFT with Mistral-7B

First error , I can't pull image although I logged in nvcr.io successfully:

12345i@ai_pc:~/NEMO_TRAIN$ docker login nvcr.io
Authenticating with existing credentials...
WARNING! Your password will be stored unencrypted in /ldap/home/12345/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded

12345@ai_pc:~/NEMO_TRAIN$docker pull nvcr.io/ea-bignlp/ga-participants/nemofw-training:24.01
Error response from daemon: pull access denied for nvcr.io/ea-bignlp/ga-participants/nemofw-training, repository does not exist or may require 'docker login': denied: requested access to the resource is denied

So I change docker image to nvcr.io/nvidia/nemo:23.10 which is mentioned in README

When I start to Step 2: Run PEFT training

Here is the error :

Traceback (most recent call last):
  File "/usr/src/app/./NeMo-main/examples/nlp/language_modeling/tuning/megatron_gpt_peft_tuning.py", line 59, in <module>
    @deprecated(
TypeError: deprecated() got an unexpected keyword argument 'wait_seconds'

So I commented out lines below in megatron_gpt_peft_tuning.py:

# @deprecated(
#    wait_seconds=20,
#    explanation=f"\n{banner}\nmegatron_gpt_peft_tuning.py is renamed to megatron_gpt_finetuning.py with the "
#    f"same functionality. \nPlease switch to the new name.\n{banner}\n",
)

Then it starts to work , Here is process:

appuser@ai_pc:/usr/src/app$ python3 ./NeMo-main/examples/nlp/language_modeling/tuning/megatron_gpt_peft_tuning.py     trainer.devices=1     trainer.num_nodes=1     trainer.precision=bf16     trainer.val_check_interval=20     trainer.max_steps=50     model.megatron_amp_O2=False     ++model.mcore_gpt=True     model.tensor_model_parallel_size=1     model.pipeline_model_parallel_size=1     model.micro_batch_size=1     model.global_batch_size=8     model.restore_from_path=${MODEL}     model.data.train_ds.num_workers=0     model.data.validation_ds.num_workers=1     model.data.train_ds.file_names=${TRAIN_DS}     model.data.train_ds.concat_sampling_probabilities=[1.0]    
 model.data.validation_ds.file_names=${VALID_DS}     model.peft.peft_scheme=${SCHEME}     exp_manager.checkpoint_callback_params.mode=min
[NeMo W 2024-02-29 13:43:09 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

[NeMo I 2024-02-29 13:43:09 megatron_gpt_peft_tuning:66] 

    ************** Experiment configuration ***********
[NeMo I 2024-02-29 13:43:09 megatron_gpt_peft_tuning:67] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: bf16
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 50
      log_every_n_steps: 10
      val_check_interval: 20
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: null
      exp_dir: null
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.validation_ds.metric.name}
        save_top_k: 1
        mode: min
        save_nemo_on_train_end: true
        filename: ${name}--{${exp_manager.checkpoint_callback_params.monitor}:.3f}-{step}-{consumed_samples}
        model_parallel_size: ${model.tensor_model_parallel_size}
        always_save_nemo: false
        save_best_model: true
      create_early_stopping_callback: true
      early_stopping_callback_params:
        monitor: val_loss
        mode: min
        min_delta: 0.001
        patience: 10
        verbose: true
        strict: false
    model:
      seed: 1234
      tensor_model_parallel_size: 1
      pipeline_model_parallel_size: 1
      global_batch_size: 8
      micro_batch_size: 1
      restore_from_path: ./model/mistral.nemo
      resume_from_checkpoint: null
      save_nemo_on_validation_end: false
      sync_batch_comm: false
      megatron_amp_O2: false
      sequence_parallel: false
      activations_checkpoint_granularity: null
      activations_checkpoint_method: null
      activations_checkpoint_num_layers: null
      activations_checkpoint_layers_per_pipeline: null
      answer_only_loss: true
      gradient_as_bucket_view: false
      hidden_dropout: 0.0
      attention_dropout: 0.0
      ffn_dropout: 0.0
      peft:
        peft_scheme: lora
        restore_from_path: null
        adapter_tuning:
          type: parallel_adapter
          adapter_dim: 32
          adapter_dropout: 0.0
          norm_position: pre
          column_init_method: xavier
          row_init_method: zero
          norm_type: mixedfusedlayernorm
          layer_selection: null
          weight_tying: false
          position_embedding_strategy: null
        lora_tuning:
          target_modules:
          - attention_qkv
          adapter_dim: 32
          alpha: ${model.peft.lora_tuning.adapter_dim}
          adapter_dropout: 0.0
          column_init_method: xavier
          row_init_method: zero
          layer_selection: null
          weight_tying: false
          position_embedding_strategy: null
        p_tuning:
          virtual_tokens: 10
          bottleneck_dim: 1024
          embedding_dim: 1024
          init_std: 0.023
        ia3_tuning:
          layer_selection: null
        selective_tuning:
          tunable_base_param_names:
          - self_attention
          - word_embeddings
      data:
        train_ds:
          file_names:
          - ./pubmedqa/pubmedqa_train.jsonl
          global_batch_size: ${model.global_batch_size}
          micro_batch_size: ${model.micro_batch_size}
          shuffle: true
          num_workers: 0
          memmap_workers: 2
          pin_memory: true
          max_seq_length: 2048
          min_seq_length: 1
          drop_last: true
          concat_sampling_probabilities:
          - 1.0
          label_key: output
          add_eos: true
          add_sep: false
          add_bos: false
          truncation_field: input
          index_mapping_dir: null
          prompt_template: '{input} {output}'
          truncation_method: right
        validation_ds:
          file_names:
          - ./pubmedqa/pubmedqa_val.jsonl
          names: null
          global_batch_size: ${model.global_batch_size}
          micro_batch_size: ${model.micro_batch_size}
          shuffle: false
          num_workers: 1
          memmap_workers: ${model.data.train_ds.memmap_workers}
          pin_memory: true
          max_seq_length: 2048
          min_seq_length: 1
          drop_last: false
          label_key: ${model.data.train_ds.label_key}
          add_eos: ${model.data.train_ds.add_eos}
          add_sep: ${model.data.train_ds.add_sep}
          add_bos: ${model.data.train_ds.add_bos}
          write_predictions_to_file: false
          output_file_path_prefix: null
          truncation_field: ${model.data.train_ds.truncation_field}
          index_mapping_dir: null
          prompt_template: ${model.data.train_ds.prompt_template}
          tokens_to_generate: 32
          truncation_method: right
          metric:
            name: loss
            average: null
            num_classes: null
        test_ds:
          file_names:
          - ./pubmedqa/pubmedqa_test.jsonl
          names:
          - pubmedqa
          global_batch_size: ${model.global_batch_size}
          micro_batch_size: ${model.micro_batch_size}
          shuffle: false
          num_workers: 0
          memmap_workers: ${model.data.train_ds.memmap_workers}
          pin_memory: true
          max_seq_length: 2048
          min_seq_length: 1
          drop_last: false
          label_key: ${model.data.train_ds.label_key}
          add_eos: ${model.data.train_ds.add_eos}
          add_sep: ${model.data.train_ds.add_sep}
          add_bos: ${model.data.train_ds.add_bos}
          write_predictions_to_file: false
          output_file_path_prefix: null
          truncation_field: ${model.data.train_ds.truncation_field}
          index_mapping_dir: null
          prompt_template: ${model.data.train_ds.prompt_template}
          tokens_to_generate: 32
          truncation_method: right
          metric:
            name: loss
            average: null
            num_classes: null
      optim:
        name: fused_adam
        lr: 0.0001
        weight_decay: 0.01
        betas:
        - 0.9
        - 0.98
        sched:
          name: CosineAnnealing
          warmup_steps: 50
          min_lr: 0.0
          constant_steps: 0
          monitor: val_loss
          reduce_on_plateau: false
      mcore_gpt: true

[NeMo W 2024-02-29 13:43:09 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/lightning_fabric/connector.py:554: UserWarning: bf16 is supported for historical reasons but its usage is discouraged. Please set your precision to bf16-mixed instead!
      rank_zero_warn(

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo W 2024-02-29 13:43:09 exp_manager:754] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
[NeMo W 2024-02-29 13:43:09 exp_manager:611] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/usr/src/app/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints. Training from scratch.
[NeMo I 2024-02-29 13:43:09 exp_manager:394] Experiments will be logged at /usr/src/app/nemo_experiments/megatron_gpt_peft_lora_tuning
[NeMo I 2024-02-29 13:43:09 exp_manager:835] TensorboardLogger has been set up
[NeMo W 2024-02-29 13:43:09 exp_manager:931] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 50. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: pipeline_model_parallel_split_rank in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo I 2024-02-29 13:43:32 megatron_init:234] Rank 0 has data parallel group: [0]
[NeMo I 2024-02-29 13:43:32 megatron_init:237] All data parallel group ranks: [[0]]
[NeMo I 2024-02-29 13:43:32 megatron_init:238] Ranks 0 has data parallel rank: 0
[NeMo I 2024-02-29 13:43:32 megatron_init:246] Rank 0 has model parallel group: [0]
[NeMo I 2024-02-29 13:43:32 megatron_init:247] All model parallel group ranks: [[0]]
[NeMo I 2024-02-29 13:43:32 megatron_init:257] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-02-29 13:43:32 megatron_init:261] All tensor model parallel group ranks: [[0]]
[NeMo I 2024-02-29 13:43:32 megatron_init:262] Rank 0 has tensor model parallel rank: 0
[NeMo I 2024-02-29 13:43:32 megatron_init:276] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2024-02-29 13:43:32 megatron_init:288] Rank 0 has embedding group: [0]
[NeMo I 2024-02-29 13:43:32 megatron_init:294] All pipeline model parallel group ranks: [[0]]
[NeMo I 2024-02-29 13:43:32 megatron_init:295] Rank 0 has pipeline model parallel rank 0
[NeMo I 2024-02-29 13:43:32 megatron_init:296] All embedding group ranks: [[0]]
[NeMo I 2024-02-29 13:43:32 megatron_init:297] Rank 0 has embedding rank: 0
24-02-29 13:43:32 - PID:2242 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 8
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: pipeline_model_parallel_split_rank in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo I 2024-02-29 13:43:32 tokenizer_utils:191] Getting SentencePiece with model: /tmp/tmpdiuso3r4/49ec005ed7e34fdab15e18659ba3b65a_tokenizer.model
[NeMo I 2024-02-29 13:43:32 megatron_base_model:315] Padded vocab_size: 32000, original vocab_size: 32000, dummy tokens: 0.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: pipeline_model_parallel_split_rank in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_gpt_model:1554] apply_query_key_layer_scaling is only enabled when using FP16, setting it to False and setting NVTE_APPLY_QK_LAYER_SCALING=0
[NeMo W 2024-02-29 13:43:32 megatron_gpt_model:1619] The model: MegatronGPTSFTModel() does not have field.name: num_moe_experts in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_gpt_model:1619] The model: MegatronGPTSFTModel() does not have field.name: fp8_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-02-29 13:43:32 megatron_gpt_model:1619] The model: MegatronGPTSFTModel() does not have field.name: clone_scatter_output_in_embedding in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Loading distributed checkpoint with TensorStoreLoadShardedStrategy
[NeMo I 2024-02-29 13:44:52 nlp_overrides:752] Model MegatronGPTSFTModel was successfully restored from /usr/src/app/model/mistral.nemo.
[NeMo I 2024-02-29 13:44:52 megatron_gpt_peft_tuning:82] Adding adapter weights to the model for PEFT
[NeMo I 2024-02-29 13:44:52 nlp_adapter_mixins:182] Before adding PEFT params:
      | Name  | Type     | Params
    -----------------------------------
    0 | model | GPTModel | 7.2 B 
    -----------------------------------
    0         Trainable params
    7.2 B     Non-trainable params
    7.2 B     Total params
    28,966.928Total estimated model params size (MB)
[NeMo I 2024-02-29 13:44:55 nlp_adapter_mixins:195] After adding PEFT params:
      | Name  | Type     | Params
    -----------------------------------
    0 | model | GPTModel | 7.3 B 
    -----------------------------------
    10.5 M    Trainable params
    7.2 B     Non-trainable params
    7.3 B     Total params
    29,008.871Total estimated model params size (MB)
[NeMo W 2024-02-29 13:44:55 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:153: UserWarning: The `batch_idx` argument in `MegatronGPTSFTModel.on_train_batch_start` hook may not match with the actual batch index when using a `dataloader_iter` argument in your `training_step`.
      rank_zero_warn(

[NeMo W 2024-02-29 13:44:55 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:153: UserWarning: The `batch_idx` argument in `MegatronGPTSFTModel.on_train_batch_end` hook may not match with the actual batch index when using a `dataloader_iter` argument in your `training_step`.
      rank_zero_warn(

[NeMo I 2024-02-29 13:44:55 megatron_gpt_sft_model:752] Building GPT SFT validation datasets.
[NeMo I 2024-02-29 13:44:55 text_memmap_dataset:116] Building data files
[NeMo I 2024-02-29 13:44:55 text_memmap_dataset:525] Processing 1 data files using 2 workers
[NeMo I 2024-02-29 13:44:55 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.098850
[NeMo I 2024-02-29 13:44:55 text_memmap_dataset:525] Processing 1 data files using 2 workers
[NeMo I 2024-02-29 13:44:55 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.053183
[NeMo I 2024-02-29 13:44:55 text_memmap_dataset:158] Loading data files
[NeMo I 2024-02-29 13:44:55 text_memmap_dataset:249] Loading ./pubmedqa/pubmedqa_val.jsonl
[NeMo I 2024-02-29 13:44:55 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000691
[NeMo I 2024-02-29 13:44:55 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-02-29 13:44:55 megatron_gpt_sft_model:755] Length of val dataset: 50
[NeMo I 2024-02-29 13:44:55 megatron_gpt_sft_model:759] Building GPT SFT test datasets.
[NeMo I 2024-02-29 13:44:55 text_memmap_dataset:116] Building data files
[NeMo I 2024-02-29 13:44:55 text_memmap_dataset:525] Processing 1 data files using 2 workers
[NeMo I 2024-02-29 13:44:56 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.047882
[NeMo I 2024-02-29 13:44:56 text_memmap_dataset:525] Processing 1 data files using 2 workers
[NeMo I 2024-02-29 13:44:56 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.047631
[NeMo I 2024-02-29 13:44:56 text_memmap_dataset:158] Loading data files
[NeMo I 2024-02-29 13:44:56 text_memmap_dataset:249] Loading ./pubmedqa/pubmedqa_test.jsonl
[NeMo I 2024-02-29 13:44:56 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000603
[NeMo I 2024-02-29 13:44:56 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-02-29 13:44:56 megatron_gpt_sft_model:762] Length of test dataset: 500
[NeMo I 2024-02-29 13:44:56 megatron_gpt_sft_model:766] Building GPT SFT traing datasets.
[NeMo I 2024-02-29 13:44:56 text_memmap_dataset:116] Building data files
[NeMo I 2024-02-29 13:44:56 text_memmap_dataset:525] Processing 1 data files using 2 workers
[NeMo I 2024-02-29 13:44:56 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.050147
[NeMo I 2024-02-29 13:44:56 text_memmap_dataset:525] Processing 1 data files using 2 workers
[NeMo I 2024-02-29 13:44:56 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.049478
[NeMo I 2024-02-29 13:44:56 text_memmap_dataset:158] Loading data files
[NeMo I 2024-02-29 13:44:56 text_memmap_dataset:249] Loading ./pubmedqa/pubmedqa_train.jsonl
[NeMo I 2024-02-29 13:44:56 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000572
[NeMo I 2024-02-29 13:44:56 text_memmap_dataset:165] Computing global indices
[NeMo W 2024-02-29 13:44:56 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/data/language_modeling/megatron/dataset_utils.py:1332: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.)
      counts = torch.cuda.LongTensor([1])

[NeMo I 2024-02-29 13:44:57 dataset_utils:1341]  > loading indexed mapping from ./pubmedqa/pubmedqa_train.jsonl_pubmedqa_train.jsonl_indexmap_402mns_2046msl_0.00ssp_1234s.npy
[NeMo I 2024-02-29 13:44:57 dataset_utils:1344]     loaded indexed file in 0.000 seconds
[NeMo I 2024-02-29 13:44:57 dataset_utils:1345]     total number of samples: 450
make: Entering directory '/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/data/language_modeling/megatron'
make: Nothing to be done for 'default'.
make: Leaving directory '/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/data/language_modeling/megatron'
> building indices for blendable datasets ...
 > sample ratios:
   dataset 0, input: 1, achieved: 1
[NeMo I 2024-02-29 13:44:57 blendable_dataset:67] > elapsed time for building blendable dataset indices: 0.03 (sec)
[NeMo I 2024-02-29 13:44:57 megatron_gpt_sft_model:768] Length of train dataset: 402
[NeMo I 2024-02-29 13:44:57 megatron_gpt_sft_model:773] Building dataloader with consumed samples: 0
[NeMo I 2024-02-29 13:44:57 megatron_gpt_sft_model:773] Building dataloader with consumed samples: 0
[NeMo I 2024-02-29 13:44:57 megatron_gpt_sft_model:773] Building dataloader with consumed samples: 0
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo I 2024-02-29 13:44:57 nlp_overrides:150] Configuring DDP for model parallelism.
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-02-29 13:44:57 nlp_adapter_mixins:245] Optimizer groups set:
      | Name  | Type     | Params
    -----------------------------------
    0 | model | GPTModel | 7.3 B 
    -----------------------------------
    10.5 M    Trainable params
    7.2 B     Non-trainable params
    7.3 B     Total params
    29,008.871Total estimated model params size (MB)
[NeMo I 2024-02-29 13:44:57 modelPT:728] Optimizer config = FusedAdam (
    Parameter Group 0
        betas: [0.9, 0.98]
        bias_correction: True
        eps: 1e-08
        lr: 0.0001
        weight_decay: 0.01
    )
[NeMo I 2024-02-29 13:44:57 lr_scheduler:910] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7ff9feffc070>" 
    will be used during training (effective maximum steps = 50) - 
    Parameters : 
    (warmup_steps: 50
    min_lr: 0.0
    constant_steps: 0
    max_steps: 50
    )

  | Name  | Type     | Params
-----------------------------------
0 | model | GPTModel | 7.3 B 
-----------------------------------
10.5 M    Trainable params
7.2 B     Non-trainable params
7.3 B     Total params
29,008.871Total estimated model params size (MB)

But there is another error:

Sanity Checking: 0it [00:00, ?it/s][NeMo W 2024-02-29 13:45:11 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:438: PossibleUserWarning: The dataloader, val_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 224 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
      rank_zero_warn(

[NeMo W 2024-02-29 13:45:11 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:148: UserWarning: Found `dataloader_iter` argument in the `validation_step`. Note that the support for this signature is experimental and the behavior is subject to change.
      rank_zero_warn(

[NeMo W 2024-02-29 13:45:25 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/apex/transformer/pipeline_parallel/utils.py:81: UserWarning: This function is only for unittest
      warnings.warn("This function is only for unittest")

Sanity Checking DataLoader 0:   0%|                                                                                                                                                                       | 0/2 [00:00<?, ?it/s]Error executing job with overrides: ['trainer.devices=1', 'trainer.num_nodes=1', 'trainer.precision=bf16', 'trainer.val_check_interval=20', 'trainer.max_steps=50', 'model.megatron_amp_O2=False', '++model.mcore_gpt=True', 'model.tensor_model_parallel_size=1', 'model.pipeline_model_parallel_size=1', 'model.micro_batch_size=1', 'model.global_batch_size=8', 'model.restore_from_path=./model/mistral.nemo', 'model.data.train_ds.num_workers=0', 'model.data.validation_ds.num_workers=1', 'model.data.train_ds.file_names=[./pubmedqa/pubmedqa_train.jsonl]', 'model.data.train_ds.concat_sampling_probabilities=[1.0]', 'model.data.validation_ds.file_names=[./pubmedqa/pubmedqa_val.jsonl]', 'model.peft.peft_scheme=lora', 'exp_manager.checkpoint_callback_params.mode=min']
Traceback (most recent call last):
  File "/usr/src/app/./NeMo-main/examples/nlp/language_modeling/tuning/megatron_gpt_peft_tuning.py", line 87, in main
    trainer.fit(model)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1021, in _run_stage
    self._run_sanity_check()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1050, in _run_sanity_check
    val_loop.run()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 115, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 376, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_kwargs.values())
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 293, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 338, in validation_step
    return self.model(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1521, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1357, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/overrides/base.py", line 102, in forward
    return self._forward_module.validation_step(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py", line 387, in validation_step
    return self.inference_step(dataloader_iter, batch_idx, 'validation', dataloader_idx)
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py", line 403, in inference_step
    loss = super().validation_step(itertools.chain([batch]), batch_idx)
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 963, in validation_step
    loss = self.fwd_bwd_step(dataloader_iter, batch_idx, True)
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py", line 347, in fwd_bwd_step
    losses_reduced_per_micro_batch = fwd_bwd_function(
  File "/usr/local/lib/python3.10/dist-packages/megatron/core/pipeline_parallel/schedules.py", line 327, in forward_backward_no_pipelining
    output_tensor = forward_step(
  File "/usr/local/lib/python3.10/dist-packages/megatron/core/pipeline_parallel/schedules.py", line 183, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 863, in fwd_output_and_loss_func
    output_tensor = model(**forward_args)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/megatron/core/models/gpt/gpt_model.py", line 166, in forward
    hidden_states = self.decoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/megatron/core/transformer/transformer_block.py", line 311, in forward
    hidden_states, context = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/megatron/core/transformer/transformer_layer.py", line 153, in forward
    attention_output_with_bias = self.self_attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/megatron/core/transformer/attention.py", line 263, in forward
    core_attn_out = self.core_attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/megatron/core/transformer/custom_layers/transformer_engine.py", line 427, in forward
    return super().forward(
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 1235, in forward
    qkv_layout = _get_qkv_layout(query_layer, key_layer, value_layer,
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 464, in _get_qkv_layout
    raise Exception("The provided qkv memory layout is not supported!")
Exception: The provided qkv memory layout is not supported!

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Additional context GPU : H100*1

NVIDIA / NeMo

Can't start to finetune and can't pull nvcr.io/nvidia/nemo:24.01 or nvcr.io/ea-bignlp/ga-participants/nemofw-training:24.01 #8550