Context parallel does not work in some cases which works well using megatron-lm directly

XLzed commented 3 months ago

Describe the bug

Context parallel does not work in some cases, such as pretrain llama-34b with 64 A800 GPUs and seqlen>=32768. But using megatron-lm directly has no problem with the same config. I want to use the SFT support like sequence packing in Nemo, hope to solve this soon.

Environment details

image: nvcr.io/nvidia/nemo:24.03.01.framework
32 x A800

test cases

error msg details

kernel assertion

ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [64,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [65,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.

2.nccl timeout (tp=2, pp=4, cp=8)

terminate called after throwing an instance of 'c10::DistBackendError'
what():  [PG 5 Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=910, OpType=_ALLGATHER_BASE, NumelIn=16777216, NumelOut=33554432, Timeout(ms)=600000) ran for 600021 milliseconds before timing out.

3.nccl timeout (tp=8, pp=1, cp=8)

[rank7]:[E ProcessGroupNCCL.cpp:574] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
[rank7]:[E ProcessGroupNCCL.cpp:574] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21161, OpType=SEND, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=600000) ran for 600022 milliseconds before timing out.
[rank7]:[E ProcessGroupNCCL.cpp:574] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15334, OpType=_ALLGATHER_BASE, NumelIn=8388608, NumelOut=67108864, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.

pretrain_llama34b_config.yaml

name: megatron_llama_34b

trainer:
  devices: 8
  num_nodes: 4
  accelerator: gpu
  precision: bf16
  logger: False # logger provided by exp_manager
  enable_checkpointing: False
  use_distributed_sampler: False
  max_epochs: -1 # PTL default. In practice, max_steps will be reached first. 
  max_steps: 10 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
  log_every_n_steps: 1
  val_check_interval: 10
  limit_val_batches: 1
  limit_test_batches: 1
  accumulate_grad_batches: 1 # do not modify, grad acc is automatic for training megatron models
  gradient_clip_val: 1.0
  benchmark: False
  enable_model_summary: False # default PTL callback for this does not support model parallelism, instead we log manually
  num_sanity_val_steps: 0

exp_manager:
  explicit_log_dir: null
  exp_dir: null
  name: megatron_llama_34b
  create_wandb_logger: False
  wandb_logger_kwargs:
    project: null
    name: null
  resume_if_exists: True
  resume_ignore_no_checkpoint: True
  create_checkpoint_callback: False
  checkpoint_callback_params:
    monitor: val_loss
    save_top_k: 1
    mode: min
    always_save_nemo: False # saves nemo file during validation, not implemented for model parallel
    save_nemo_on_train_end: False # not recommended when training large models on clusters with short time limits
    filename: megatron_gpt--{val_loss:.2f}-{step}-{consumed_samples}
    model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
    save_last: false

model:
  mcore_gpt: True
  # specify micro_batch_size, global_batch_size, and model parallelism
  # gradient accumulation will be done automatically based on data_parallel_size
  micro_batch_size: 1 # limited by GPU memory
  global_batch_size: 1 # will use more micro batches to reach global batch size
  tensor_model_parallel_size: 8 # intra-layer model parallelism
  pipeline_model_parallel_size: 1 # inter-layer model parallelism
  context_parallel_size: 4
  virtual_pipeline_model_parallel_size: null # interleaved pipeline
  use_flash_attention: True

  encoder_seq_length: 32768
  max_position_embeddings: ${.encoder_seq_length}
  num_layers: 48 # 7b: 32 | 13b: 40 | 70b: 80
  hidden_size: 8192 # 7b: 4096 | 13b: 5120 | 70b: 8192
  ffn_hidden_size: 22016 # Transformer FFN hidden size. Usually 4 * hidden_size. | 7b: 11008 | 13b: 13824 | 70b: 28672
  num_attention_heads: 64 # 7b: 32 | 13b: 40 | 70b: 64
  use_scaled_init_method: True # use scaled residuals initialization
  hidden_dropout: 0.0 # Dropout probability for hidden state transformer.
  attention_dropout: 0.0 # Dropout probability for attention
  ffn_dropout: 0.0 # Dropout probability in the feed-forward layer.
  kv_channels: null # Projection weights dimension in multi-head attention. Set to hidden_size // num_attention_heads if null
  normalization: 'rmsnorm' # Normalization layer to use. Options are 'layernorm', 'rmsnorm'
  layernorm_epsilon: 1e-5
  do_layer_norm_weight_decay: False # True means weight decay on all params
  make_vocab_size_divisible_by: 256 # Pad the vocab size to be divisible by this value for computation efficiency.
  pre_process: True # add embedding
  post_process: True # add pooler
  persist_layer_norm: True # Use of persistent fused layer norm kernel.
  bias: False # Whether to use bias terms in all weight matrices.
  activation: 'fast-swiglu' # Options ['gelu', 'geglu', 'swiglu', 'reglu', 'squared-relu', 'fast-geglu', 'fast-swiglu', 'fast-reglu']
  headscale: False # Whether to learn extra parameters that scale the output of the each self-attention head.
  transformer_block_type: 'pre_ln' # Options ['pre_ln', 'post_ln', 'normformer']
  openai_gelu: False # Use OpenAI's GELU instead of the default GeLU
  normalize_attention_scores: True # Whether to scale the output Q * K^T by 1 / sqrt(hidden_size_per_head). This arg is provided as a configuration option mostly for compatibility with models that have been weight-converted from HF. You almost always want to se this to True.
  attention_type: 'multihead' # Attention type. Options ['multihead']
  share_embeddings_and_output_weights: False # Share embedding and output layer weights.
  overlap_p2p_comm: False # Overlap p2p communication with computes. This argument is valid only when `virtual_pipeline_model_parallel_size` is larger than 1
  batch_p2p_comm: True # Batch consecutive inter-peer send/recv operations. This argument is valid only when `virtual_pipeline_model_parallel_size` is larger than 1
  num_query_groups: 8 # Number of query groups for group query attention. If None, normal attention is used. | 7b: 32 | 13b: 40 | 70b: 8

  position_embedding_type: 'rope' # Position embedding type. Options ['learned_absolute', 'rope']
  rotary_percentage: 1.0 # If using position_embedding_type=rope, then the per head dim is multiplied by this.
  init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.')
  apply_query_key_layer_scaling: True # scale Q * K^T by 1 / layer-number.

  tokenizer:
    library: 'sentencepiece'
    type: null
    model: /workspace/Models/Original/CodeLlama-34b-hf/tokenizer.model
    vocab_file: null
    merge_file: null 
    delimiter: null # only used for tabular tokenizer
    sentencepiece_legacy: False # Legacy=True allows you to add special tokens to sentencepiece tokenizers.

  # Mixed precision
  # native_amp_init_scale: 4294967296 # 2 ** 32
  # native_amp_growth_interval: 1000
  # hysteresis: 2 # Gradient scale hysteresis
  # fp32_residual_connection: False # Move residual connections to fp32
  # fp16_lm_cross_entropy: False # Move the cross entropy unreduced loss calculation for lm head to fp16

  # Megatron O2-style half-precision
  megatron_amp_O2: True # Enable O2-level automatic mixed precision using main parameters
  grad_allreduce_chunk_size_mb: 125

  # Fusion
  grad_div_ar_fusion: True # Fuse grad division into torch.distributed.all_reduce. Only used with O2 and no pipeline parallelism..
  gradient_accumulation_fusion: True # Fuse weight gradient accumulation to GEMMs. Only used with pipeline parallelism and O2.
  bias_activation_fusion: False # Use a kernel that fuses the bias addition from weight matrices with the subsequent activation function.
  bias_dropout_add_fusion: False # Use a kernel that fuses the bias addition, dropout and residual connection addition.
  masked_softmax_fusion: False # Use a kernel that fuses the attention softmax with it's mask.
  get_attention_mask_from_fusion: True # When using fused softmax it will create the attention mask so we won't copy it to the pipeline stages.
  apply_rope_fusion: True # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope

  # Miscellaneous
  seed: 1234
  resume_from_checkpoint: null # manually set the checkpoint file to load from
  use_cpu_initialization: False # Init weights on the CPU (slow for large models)
  onnx_safe: False # Use work-arounds for known problems with Torch ONNX exporter.
  apex_transformer_log_level: 30 # Python logging level displays logs with severity greater than or equal to this
  gradient_as_bucket_view: True # PyTorch DDP argument. Allocate gradients in a contiguous bucket to save memory (less fragmentation and buffer memory)
  sync_batch_comm: False # Enable stream synchronization after each p2p communication between pipeline stages

  ## Activation Checkpointing
  activations_checkpoint_granularity: null # 'selective' or 'full' 
  activations_checkpoint_method: null # 'uniform', 'block'
  activations_checkpoint_num_layers: null
  num_micro_batches_with_partial_activation_checkpoints: null
  activations_checkpoint_layers_per_pipeline: null

  sequence_parallel: True

  ## Transformer Engine
  transformer_engine: True
  fp8: False # enables fp8 in TransformerLayer forward
  fp8_e4m3: False # sets fp8_format = recipe.Format.E4M3 
  fp8_hybrid: True # sets fp8_format = recipe.Format.HYBRID
  fp8_margin: 0 # scaling margin 
  fp8_interval: 1 # scaling update interval
  fp8_amax_history_len: 1024 # Number of steps for which amax history is recorded per tensor
  fp8_amax_compute_algo: max # 'most_recent' or 'max'. Algorithm for computing amax from history
  reduce_amax: True # Perform reduction to sync amax tensors across GPUs after every iteration
  use_emha: False # Use fused multi-head attention for large sequence-length. Note this is not yet supported. Please set to False.

  data:
    data_prefix: 
      - 1.0
      - /workspace/Jianzhou/Datasets/Nemo/pretrain/EVAL_C4_text_document
    index_mapping_dir: null # path to save index mapping .npy files, by default will save in the same location as data_prefix
    data_impl: mock
    splits_string: 900,50,50
    seq_length: ${model.encoder_seq_length}
    skip_warmup: True
    num_workers: 0
    dataloader_type: single # cyclic
    reset_position_ids: False # Reset position ids after end-of-document token
    reset_attention_mask: False # Reset attention mask after end-of-document token
    eod_mask_loss: False # Mask loss for the end of document tokens
    validation_drop_last: True # Set to false if the last partial validation samples is to be consumed
    no_seqlen_plus_one_input_tokens: False # Set to True to disable fetching (sequence length + 1) input tokens, instead get (sequence length) input tokens and mask the last token
    pad_samples_to_global_batch_size: False # Set to True if you want to pad the last partial batch with -1's to equal global batch size
    shuffle_documents: True # Set to False to disable documents shuffling. Sample index will still be shuffled

  # Nsys profiling options
  nsys_profile:
    enabled: False
    start_step: 5  # Global batch to start profiling
    end_step: 6 # Global batch to end profiling
    ranks: [0] # Global rank IDs to profile
    gen_shape: False # Generate model and kernel details including input shapes

  optim:
    name: distributed_fused_adam
    # overlap_grad_sync: True
    # overlap_param_sync: True
    # contiguous_grad_buffer: True
    lr: 2e-4
    weight_decay: 0.01 
    betas: 
    - 0.9
    - 0.98
    sched:
      name: CosineAnnealing
      warmup_steps: 500
      constant_steps: 50000
      min_lr: 2e-5

czhang99 commented 1 month ago

Met the same timeout issue. The issue is coming from NCCL default timeout value not set effectively when NeMo calls Megatron. If using NCCL backend, the default timeout value is changed to 10 mins. That's why you see NCCL timeout as Timeout(ms)=600000 (600000 milliseconds = 10 mins)

Need to make code changes to pass a larger timeout value (e.g. 30 mins) https://github.com/NVIDIA/Megatron-LM/commit/e69187bc3679ea5841030a165d587bb48b56ee77 and also when calling parallel_state.initialize_model_parallel from NeMo https://github.com/NVIDIA/NeMo/blob/v1.23.0/nemo/collections/nlp/parts/nlp_overrides.py#L125

Alternatively, upgrade to a later version of Megatron after the commit above.

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 week ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

NVIDIA / NeMo

Context parallel does not work in some cases which works well using megatron-lm directly #8992