Llama2 70B SFT with FSDP failing

satheeshkatipomu commented 4 months ago

Unable to fine-tune Llama2 70B with FSDP

I am trying to fine-tune Llama2 70B model on a dataset, with TP=4, PP=8 it is working fine. But with FSDP on 6 nodes it is failing with below error

File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 670, in setup_environment
    for p in self.model.parameters():
AttributeError: 'NoneType' object has no attribute 'parameters'

Steps/Code to reproduce bug

Converted Llama2 70B base model checkpoint from huggingface to nemo format

Started training on 6 nodes with the below config.

run:
name: llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1_llama2_70b
time_limit: 3-04:00:00
dependency: singleton
convert_name: convert_nemo
model_train_name: llama2_70b
convert_dir: ~/Projects/NeMo/nemo_launcher/NeMo-Megatron-Launcher/launcher_scripts/results/llama2_70b/convert_nemo
task_name: llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1
results_dir: ~/Projects/NeMo/nemo_launcher/NeMo-Megatron-Launcher/launcher_scripts/results/llama2_70b/llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1
trainer:
devices: 8
accelerator: gpu
num_nodes: 6
precision: bf16
logger: false
enable_checkpointing: false
use_distributed_sampler: false
max_epochs: null
max_steps: 13000
log_every_n_steps: 10
val_check_interval: 300
gradient_clip_val: 1.0
exp_manager:
explicit_log_dir: ~/Projects/NeMo/nemo_launcher/NeMo-Megatron-Launcher/launcher_scripts/results/llama2_70b/llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1/results
exp_dir: null
name: megatron_llama_llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1
create_wandb_logger: false
wandb_logger_kwargs:
project: nemo_llama_llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1
name: llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1_llama2_70b
resume_if_exists: true
resume_ignore_no_checkpoint: true
create_checkpoint_callback: true
checkpoint_callback_params:
monitor: validation_loss
save_top_k: 5
mode: min
save_nemo_on_train_end: true
filename: megatron_gpt_sft--{validation_loss:.3f}-{step}-{consumed_samples}
model_parallel_size: 4
save_best_model: true
model:
seed: 1234
tensor_model_parallel_size: 4
pipeline_model_parallel_size: 1
global_batch_size: 528
micro_batch_size: 1
restore_from_path: /workspace/llama2_models
resume_from_checkpoint: null
save_nemo_on_validation_end: false
sync_batch_comm: false
megatron_amp_O2: false
sequence_parallel: true
activations_checkpoint_granularity: selective
activations_checkpoint_method: uniform
activations_checkpoint_num_layers: null
answer_only_loss: true
gradient_as_bucket_view: false
seq_len_interpolation_factor: null
use_flash_attention: true
hidden_dropout: 0.1
attention_dropout: 0.1
ffn_dropout: 0.1
fsdp: true
fsdp_sharding_strategy: full
fsdp_grad_reduce_dtype: bf16
fsdp_sharded_checkpoint: false
fsdp_use_orig_params: false
peft:
peft_scheme: null
restore_from_path: null
adapter_tuning:
  type: parallel_adapter
  adapter_dim: 32
  adapter_dropout: 0.0
  norm_position: pre
  column_init_method: xavier
  row_init_method: zero
  norm_type: mixedfusedlayernorm
  layer_selection: null
  weight_tying: false
  position_embedding_strategy: null
lora_tuning:
  adapter_dim: 32
  adapter_dropout: 0.0
  column_init_method: xavier
  row_init_method: zero
  layer_selection: null
  weight_tying: false
  position_embedding_strategy: null
p_tuning:
  virtual_tokens: 10
  bottleneck_dim: 1024
  embedding_dim: 1024
  init_std: 0.023
ia3_tuning:
  layer_selection: null
data:
chat: false
train_ds:
  file_names:
  - ~/Projects/data/training.jsonl
  global_batch_size: 528
  micro_batch_size: 1
  shuffle: false
  num_workers: 4
  pin_memory: true
  max_seq_length: 4096
  min_seq_length: 1
  drop_last: true
  concat_sampling_probabilities:
  - 1.0
  context_key: input
  label_key: output
  add_eos: true
  add_sep: false
  add_bos: true
  separate_prompt_and_response_with_newline: true
  truncation_field: context
  index_mapping_dir: null
  prompt_template: '{input} {output}'
validation_ds:
  file_names:
  - ~/Projects/data/validation.jsonl
  names:
  - llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1
  global_batch_size: 528
  micro_batch_size: 1
  shuffle: false
  num_workers: 4
  pin_memory: true
  max_seq_length: 4096
  min_seq_length: 1
  drop_last: true
  context_key: input
  label_key: output
  add_eos: true
  add_sep: false
  add_bos: true
  separate_prompt_and_response_with_newline: true
  write_predictions_to_file: false
  output_file_path_prefix: null
  truncation_field: context
  index_mapping_dir: null
  prompt_template: '{input} {output}'
  metric:
    name: loss
    average: null
    num_classes: null
test_ds:
  file_names:
  - ~/Projects/data/test.jsonl
  names: null
  global_batch_size: 528
  micro_batch_size: 1
  shuffle: false
  num_workers: 4
  pin_memory: true
  max_seq_length: 4096
  min_seq_length: 1
  drop_last: true
  context_key: input
  label_key: output
  add_eos: true
  add_sep: false
  add_bos: true
  separate_prompt_and_response_with_newline: true
  write_predictions_to_file: false
  output_file_path_prefix: null
  truncation_field: context
  index_mapping_dir: null
  prompt_template: '{input} {output}'
  metric:
    name: loss
    average: null
    num_classes: null
optim:
name: fused_adam
lr: 1.0e-06
weight_decay: 0.1
betas:
- 0.9
- 0.98
sched:
  name: CosineAnnealing
  monitor: validation_loss
  min_lr: 1.0e-08
  warmup_steps: 1000
  last_epoch: -1

Expected behavior

Llama2 70B SFT works fine.

Environment details Image: nvcr.io/nvidia/nemo:24.03.01.framework Using slurm cluster.

xjohnxjohn commented 4 months ago

@satheeshkatipomu What the tools you convert "Converted Llama2 70B base model checkpoint from huggingface to nemo format"?

satheeshkatipomu commented 4 months ago

I have used convert_llama_hf_to_nemo.py script to convert llama2 70B model from huggingface format to NeMo format. Here is the exact command

python3 -u /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path=/workspace/llama2_models --output_path=/workspace/llama2_models/llama2-70b-base.nemo

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 2 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

zirui commented 2 months ago

Is there any solution to this issue?

NVIDIA / NeMo

Llama2 70B SFT with FSDP failing #9138