NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.59k stars 2.43k forks source link

Llama2 70B SFT with FSDP failing #9138

Closed satheeshkatipomu closed 2 months ago

satheeshkatipomu commented 4 months ago

Unable to fine-tune Llama2 70B with FSDP

I am trying to fine-tune Llama2 70B model on a dataset, with TP=4, PP=8 it is working fine. But with FSDP on 6 nodes it is failing with below error

File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 670, in setup_environment
    for p in self.model.parameters():
AttributeError: 'NoneType' object has no attribute 'parameters'

Steps/Code to reproduce bug

  1. Converted Llama2 70B base model checkpoint from huggingface to nemo format
  2. Started training on 6 nodes with the below config.
    run:
    name: llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1_llama2_70b
    time_limit: 3-04:00:00
    dependency: singleton
    convert_name: convert_nemo
    model_train_name: llama2_70b
    convert_dir: ~/Projects/NeMo/nemo_launcher/NeMo-Megatron-Launcher/launcher_scripts/results/llama2_70b/convert_nemo
    task_name: llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1
    results_dir: ~/Projects/NeMo/nemo_launcher/NeMo-Megatron-Launcher/launcher_scripts/results/llama2_70b/llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1
    trainer:
    devices: 8
    accelerator: gpu
    num_nodes: 6
    precision: bf16
    logger: false
    enable_checkpointing: false
    use_distributed_sampler: false
    max_epochs: null
    max_steps: 13000
    log_every_n_steps: 10
    val_check_interval: 300
    gradient_clip_val: 1.0
    exp_manager:
    explicit_log_dir: ~/Projects/NeMo/nemo_launcher/NeMo-Megatron-Launcher/launcher_scripts/results/llama2_70b/llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1/results
    exp_dir: null
    name: megatron_llama_llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1
    create_wandb_logger: false
    wandb_logger_kwargs:
    project: nemo_llama_llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1
    name: llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1_llama2_70b
    resume_if_exists: true
    resume_ignore_no_checkpoint: true
    create_checkpoint_callback: true
    checkpoint_callback_params:
    monitor: validation_loss
    save_top_k: 5
    mode: min
    save_nemo_on_train_end: true
    filename: megatron_gpt_sft--{validation_loss:.3f}-{step}-{consumed_samples}
    model_parallel_size: 4
    save_best_model: true
    model:
    seed: 1234
    tensor_model_parallel_size: 4
    pipeline_model_parallel_size: 1
    global_batch_size: 528
    micro_batch_size: 1
    restore_from_path: /workspace/llama2_models
    resume_from_checkpoint: null
    save_nemo_on_validation_end: false
    sync_batch_comm: false
    megatron_amp_O2: false
    sequence_parallel: true
    activations_checkpoint_granularity: selective
    activations_checkpoint_method: uniform
    activations_checkpoint_num_layers: null
    answer_only_loss: true
    gradient_as_bucket_view: false
    seq_len_interpolation_factor: null
    use_flash_attention: true
    hidden_dropout: 0.1
    attention_dropout: 0.1
    ffn_dropout: 0.1
    fsdp: true
    fsdp_sharding_strategy: full
    fsdp_grad_reduce_dtype: bf16
    fsdp_sharded_checkpoint: false
    fsdp_use_orig_params: false
    peft:
    peft_scheme: null
    restore_from_path: null
    adapter_tuning:
      type: parallel_adapter
      adapter_dim: 32
      adapter_dropout: 0.0
      norm_position: pre
      column_init_method: xavier
      row_init_method: zero
      norm_type: mixedfusedlayernorm
      layer_selection: null
      weight_tying: false
      position_embedding_strategy: null
    lora_tuning:
      adapter_dim: 32
      adapter_dropout: 0.0
      column_init_method: xavier
      row_init_method: zero
      layer_selection: null
      weight_tying: false
      position_embedding_strategy: null
    p_tuning:
      virtual_tokens: 10
      bottleneck_dim: 1024
      embedding_dim: 1024
      init_std: 0.023
    ia3_tuning:
      layer_selection: null
    data:
    chat: false
    train_ds:
      file_names:
      - ~/Projects/data/training.jsonl
      global_batch_size: 528
      micro_batch_size: 1
      shuffle: false
      num_workers: 4
      pin_memory: true
      max_seq_length: 4096
      min_seq_length: 1
      drop_last: true
      concat_sampling_probabilities:
      - 1.0
      context_key: input
      label_key: output
      add_eos: true
      add_sep: false
      add_bos: true
      separate_prompt_and_response_with_newline: true
      truncation_field: context
      index_mapping_dir: null
      prompt_template: '{input} {output}'
    validation_ds:
      file_names:
      - ~/Projects/data/validation.jsonl
      names:
      - llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1
      global_batch_size: 528
      micro_batch_size: 1
      shuffle: false
      num_workers: 4
      pin_memory: true
      max_seq_length: 4096
      min_seq_length: 1
      drop_last: true
      context_key: input
      label_key: output
      add_eos: true
      add_sep: false
      add_bos: true
      separate_prompt_and_response_with_newline: true
      write_predictions_to_file: false
      output_file_path_prefix: null
      truncation_field: context
      index_mapping_dir: null
      prompt_template: '{input} {output}'
      metric:
        name: loss
        average: null
        num_classes: null
    test_ds:
      file_names:
      - ~/Projects/data/test.jsonl
      names: null
      global_batch_size: 528
      micro_batch_size: 1
      shuffle: false
      num_workers: 4
      pin_memory: true
      max_seq_length: 4096
      min_seq_length: 1
      drop_last: true
      context_key: input
      label_key: output
      add_eos: true
      add_sep: false
      add_bos: true
      separate_prompt_and_response_with_newline: true
      write_predictions_to_file: false
      output_file_path_prefix: null
      truncation_field: context
      index_mapping_dir: null
      prompt_template: '{input} {output}'
      metric:
        name: loss
        average: null
        num_classes: null
    optim:
    name: fused_adam
    lr: 1.0e-06
    weight_decay: 0.1
    betas:
    - 0.9
    - 0.98
    sched:
      name: CosineAnnealing
      monitor: validation_loss
      min_lr: 1.0e-08
      warmup_steps: 1000
      last_epoch: -1

Expected behavior

Llama2 70B SFT works fine.

Environment details Image: nvcr.io/nvidia/nemo:24.03.01.framework Using slurm cluster.

xjohnxjohn commented 4 months ago

@satheeshkatipomu What the tools you convert "Converted Llama2 70B base model checkpoint from huggingface to nemo format"?

satheeshkatipomu commented 4 months ago

I have used convert_llama_hf_to_nemo.py script to convert llama2 70B model from huggingface format to NeMo format. Here is the exact command

python3 -u /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path=/workspace/llama2_models --output_path=/workspace/llama2_models/llama2-70b-base.nemo
github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 2 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

zirui commented 2 months ago

Is there any solution to this issue?