NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.95k stars 2.49k forks source link

Timeout DDP training using BioNeMo default configs #8441

Closed pengzhangzhi closed 7 months ago

pengzhangzhi commented 8 months ago

I am using the default configs, code and data to train a model within BioNeMo framework. The timeout occurs at the middle of the training.


Epoch 0:   6%|██                               | 32040/500150 [6:28:43<94:39:17,  1.37it/s, loss=2.6, v_num=95nc, reduced_train_loss=2.590, global_step=3.2e+4, consumed_samples=2.56e+7][E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624886 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800741 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800733 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800769 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800847 milliseconds before timing out.
^C

The configs are:

name: esm2nv
do_training: True # set to false if data preprocessing steps must be completed
do_testing: False # set to true to run evaluation on test data after training, requires test_dataset section
restore_from_path: null # used when starting from a .nemo file

trainer:
  devices: 8 # number of GPUs or CPUs
  num_nodes: 1 
  accelerator: gpu #gpu or cpu
  precision: 16 #16 or 32
  logger: False # logger is provided by NeMo exp_manager
  enable_checkpointing: False # checkpointing is done by NeMo exp_manager
  replace_sampler_ddp: False # use NeMo Megatron samplers
  max_epochs: null # # use max_steps instead with NeMo Megatron model
  log_every_n_steps: 10  # number of interations between logging
  val_check_interval: 15e4
  limit_val_batches: 50 # number of batches in validation step, use fraction for fraction of data, 0 to disable
  limit_test_batches: 500 # number of batches in test step, use fraction for fraction of data, 0 to disable
  accumulate_grad_batches: 1
  gradient_clip_val: 1.0
  benchmark: False
  max_steps: 500000

exp_manager:
  name: ${name}
  exp_dir: ${oc.env:BIONEMO_HOME}/results/nemo_experiments/${.name}/${.wandb_logger_kwargs.name}
  explicit_log_dir: ${.exp_dir}
  create_wandb_logger: True 
  create_tensorboard_logger: True
  wandb_logger_kwargs:
    project: ${name}_pretraining
    name: ${name}_pretraining
    group: ${name}
    job_type: Localhost_nodes_${trainer.num_nodes}_gpus_${trainer.devices}
    notes: "date: ${now:%y%m%d-%H%M%S}"
    tags:
      - ${name}
    offline: False # set to True if there are issues uploading to WandB during training
  resume_if_exists: True # automatically resume if checkpoint exists
  resume_ignore_no_checkpoint: True # leave as True, will start new training if resume_if_exists is True but no checkpoint exists
  create_checkpoint_callback: True # leave as True, use exp_manger for for checkpoints
  checkpoint_callback_params:
    monitor: val_loss
    save_top_k: 10 # number of checkpoints to save
    mode: min  # use min or max of monitored metric to select best checkpoints
    always_save_nemo: False # saves nemo file during validation, not implemented for model parallel
    filename: 'megatron_bert--{val_loss:.2f}-{step}-{consumed_samples}'
    model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}

model:
  precision: 16
  # ESM2-specific parameters
  micro_batch_size: 100
  seq_length: 1024
  num_layers: 6
  hidden_size: 320
  ffn_hidden_size: ${multiply:${model.hidden_size}, 4} # Transformer FFN hidden size. Usually 4 * hidden_size.
  num_attention_heads: 20
  megatron_legacy: false 
  position_embedding_type: rope # ESM2 uses relative positional encoding 'ROPE' to extrapolate to longer sequences unseen during training
  hidden_dropout: 0 # ESM2 removes dropout from hidden layers and attention
  embedding_use_attention_mask: True # ESM2 uses attention masking on the embeddings
  embedding_token_dropout: True # ESM2 rescales embeddings based on masked tokens
  mask_token_id: ${.tokenizer.mask_id} # Needed for token dropout rescaling
  attention_dropout: 0.0 # ESM2 does not use attention dropout
  normalize_attention_scores: False # ESM2 does not use normalized attention scores
  tensor_model_parallel_size: 1 # model parallelism
  pipeline_model_parallel_size: 1 # model parallelism. If enabled, you need to set data.dynamic_padding to False as pipeline parallelism requires fixed-length padding.
  bias_gelu_fusion: False
  # NOTE: these are compatability features
  use_esm_attention: True # Use specific attention modifications for ESM2
  esm_gelu: True # ESM2 uses custom gelu in the ML layer
  use_pt_layernorm: False # Use pytorch implementation of layernorm instead of fused nemo layernorm. Important for equivalency of results with ESM2.
  use_pt_mlp_out: False # Use pytorch implementation of attention output mlp instead of the nemo version. Important for equivalency of results with ESM2.

  # Not specified in ESM2 models:
  # model architecture
  max_position_embeddings: ${.seq_length}
  encoder_seq_length: ${.seq_length}

  optim:
    name: fused_adam # fused optimizers used by Megatron model
    lr: 4e-4
    weight_decay: 0.01
    betas:
      - 0.9
      - 0.98
    sched:
      name: CosineAnnealing
      warmup_steps: 2000
      constant_steps: 50000
      min_lr: 4e-5

  init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.')
  kv_channels: null # Projection weights dimension in multi-head attention. Set to hidden_size // num_attention_heads if null
  apply_query_key_layer_scaling: True # scale Q * K^T by 1 / layer-number.
  layernorm_epsilon: 1e-5
  make_vocab_size_divisible_by: 128 # Pad the vocab size to be divisible by this value for computation efficiency.
  pre_process: True # add embedding
  post_process: True # add pooler
  bert_binary_head: False # BERT binary head
  resume_from_checkpoint: null # manually set the checkpoint file to load from

  # NOTE: is this one of the new fields?
  masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask.
  tokenizer: 
    # Use ESM2 tokenizers from HF
    library: 'huggingface'
    type: 'BertWordPieceLowerCase'
    model_name: "???"
    mask_id: 32
    model: null
    vocab_file: null
    merge_file: null

  # precision
  native_amp_init_scale: 4294967296 # 2 ** 32
  native_amp_growth_interval: 1000
  fp32_residual_connection: False # Move residual connections to fp32
  fp16_lm_cross_entropy: False # Move the cross entropy unreduced loss calculation for lm head to fp16

  # miscellaneous
  seed: 1234
  use_cpu_initialization: False # Init weights on the CPU (slow for large model)
  onnx_safe: False # Use work-arounds for known problems with Torch ONNX exporter.

  # not implemented in NeMo yet
  activations_checkpoint_method: null # 'uniform', 'block'
  activations_checkpoint_num_layers: 1 

  data:
    ngc_registry_target: uniref50_2022_05
    ngc_registry_version: v23.06
    data_prefix: "" # must be null or ""
    num_workers: 8
    dataloader_type: single # cyclic
    reset_position_ids: False # Reset position ids after end-of-document token
    reset_attention_mask: False # Reset attention mask after end-of-document token
    eod_mask_loss: False # Mask loss for the end of document tokens
    masked_lm_prob: 0.15 # Probability of replacing a token with mask.
    short_seq_prob: 0.1 # Probability of producing a short sequence.
    skip_lines: 0
    drop_last: False
    pin_memory: False
    dynamic_padding: False # If True, each batch is padded to the maximum sequence length within that batch. 
              #    Set it to False when model.pipeline_model_parallel_size > 1, as pipeline parallelism requires fixed-length padding.

    # Below is the configuration for UF90 resampling.
    force_regen_sample_mapping: false # When true, will always generate a new uf90 resampling. otherwise, change the seed.
    data_impl: "csv_mmap"
    # Supported kwargs (with default values): 
    #     text_mmap (newline_int=10, header_lines=0, workers=None, sort_dataset_paths=True)
    #     csv_mmap (newline_int=10, header_lines=0,workers=None, sort_dataset_paths=True, data_col=1, data_sep=",")
    dataset: # inclusive range of data files to load x[000..049] or can a single file, e.g. x000
      train: x[000..049]
      test: x[000..049]
      val: x[000..049]
    data_impl_kwargs:
      csv_mmap:
        data_col: 1 # 0-based
        header_lines: 1
        # Rest of these are inherited

    uf50_datapath: ${oc.env:BIONEMO_HOME}/data/uniref50_90_202104_esm2nv_v1.0/uniref50_train_filt.fasta
    uf90_datapath: ${oc.env:BIONEMO_HOME}/data/uniref50_90_202104_esm2nv_v1.0/uniref90membersandreps_ur50trainfiltreps.fasta
    cluster_mapping_tsv: ${oc.env:BIONEMO_HOME}/data/uniref50_90_202104_esm2nv_v1.0/mapping.tsv
    # TODO These need to be updated to values we actually like
    dataset_path: ${oc.env:BIONEMO_HOME}/data/uniref50_90_202104_esm2nv_v1.0/uf50 # parent directory for data, contains train / val / test folders. Needs to be writeable for index creation.
    # NOTE test_size and val_size must be smaller than the total dataset size.
    # you can check this with grep -c '>' <uf50_datapath.fasta>
    val_size: 5000
    test_size: 1000000
    sort_fastas: false # If true, assumes the input files are not sorted, and sorts them before creating the cluster mapping. Unsorted fastas will break the cluster map.
    uf90:
      uniref90_path: ${oc.env:BIONEMO_HOME}/data/uniref50_90_202104_esm2nv_v1.0/uf90/ # created and populated my preprocessing
      dataset: 
        uf90_csvs: x[000..049] # created and populated by preprocessing, this key is a directory inside uniref90_path
      data_impl: 'csv_fields_mmap'
      data_impl_kwargs:
        csv_fields_mmap:
          header_lines: 1
          newline_int: 10 # byte-value of newline
          workers: ${model.data.num_workers} # number of workers when creating missing index files (null defaults to cpu_num / 2)
          sort_dataset_paths: True # if True datasets will be sorted by name
          data_sep: ',' # string to split text into columns
          data_fields:
            sequence: 3
            sequence_id: 1
      index_mapping_dir: ${model.data.index_mapping_dir}

    use_upsampling: True # if the data should be upsampled to max number of steps in the training
    seed: ${model.seed} # Random seed
    max_seq_length: ${model.seq_length} # Maximum input sequence length. Longer sequences are truncated
    modify_percent: 0.1 # Percentage of characters in a protein sequence to modify. (Modification means replacing with another amino acid or with a mask token)
    perturb_percent: 0.5 # Of the modify_percent, what percentage of characters are to be replaced with another amino acid.
    index_mapping_dir: ${oc.env:BIONEMO_HOME}/data/uniref50_90_202104_esm2nv_v1.0/

  dwnstr_task_validation:
    enabled: False 
    dataset:
      class: bionemo.model.core.dwnstr_task_callbacks.PerTokenPredictionCallback
      task_type: token-level-classification
      infer_target: bionemo.model.protein.esm1nv.infer.ESM1nvInference
      max_seq_length: ${model.seq_length}
      emb_batch_size: 128
      batch_size: 128
      num_epochs: 10
      shuffle: True
      num_workers: 8
      task_name: secondary_structure
      dataset_path: ${oc.env:BIONEMO_HOME}/data/FLIP/${model.dwnstr_task_validation.dataset.task_name}
      dataset:
        train: x000
        test: x000
      sequence_column: "sequence" # name of column with protein sequence in csv file
      target_column: [ "3state", "resolved" ] # names of label columns in csv file
      target_sizes: [ 3, 2 ] # number of classes in each label
      mask_column: [ "resolved", null ] # names of mask columns in csv file, masks must be 0 or 1
      random_seed: 1234
      optim:
        name: adam
        lr: 0.0001
        betas:
          - 0.9
          - 0.999
        eps: 1e-8
        weight_decay: 0.01
        sched:
          name: WarmupAnnealing
          min_lr: 0.00001
          last_epoch: -1
          warmup_ratio: 0.01
          max_steps: 1000
lakshmi-speak commented 8 months ago

Running into a similar issue training the conformer large model in a docker container with the latest nvcr.io/nvidia/nemo:23.10 image, on p2.16xlarge (V100 instances). What is your training environment like?

pengzhangzhi commented 8 months ago

This is my docker container

nvcr.io/nvidia/clara/bionemo-framework:latest   "/workspace/bionemo/…"   4 days ago   Up 13 hours                        bionemo
github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

ROZBEH commented 5 months ago

I am facing the same issue on a multi node multi GPU and without docker. I am utilizing slurm to run the job.