epfLLM / Megatron-LLM

distributed trainer for LLMs
Other
521 stars 75 forks source link

Assert in verify_correctness for mistral-7B #107

Open abgoswam opened 2 weeks ago

abgoswam commented 2 weeks ago

I am following the getting started guide with mistal-7B model.

It seems like an issue in building the dataset iterators before training. Any pointers ?

Here are my steps:

1. Converted Mistral-7B-v0.1 (works)

python hf_to_megatron.py --size 7 --out out_mistral_7b --model-path mistralai/Mistral-7B-v0.1 --cache-dir cache_mistral_7b mistral

2. Data-preprocessing (works)

python tools/preprocess_data.py \
        --input=./my_long_corpus_mistral/my_long_corpus_4096.jsonl \
        --output_prefix=my_long_corpus_4096_mistral \
        --vocab_file=./weights_conversion/out_mistral_7b/tokenizer.model \
        --tokenizer_type=SentencePieceTokenizer \
        --workers=2 \
        --dataset_impl=mmap \
        --chunk_size=32

3. Verify correctness of model conversion (FAILED)

DISTRIBUTED_ARGS="--nproc_per_node 1 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
LLAMA_ARGS="--use_rms_norm --glu_activation swiglu --no_tie_embed_logits --no_new_tokens --layernorm_epsilon 1e-5"
COMMON_ARGS="--hidden_dropout 0.0 --attention_dropout 0.0 --no_bias_gelu_fusion"

torchrun $DISTRIBUTED_ARGS verify_correctness.py \
    --model_name=mistral \
    --model_size=7 \
    --load=./weights_conversion/out_mistral_7b \
    --data_path=./my_long_corpus_mistral/my_long_corpus_4096_mistral_text_document \
    --tokenizer_type=SentencePieceTokenizer \
    --vocab_file=./weights_conversion/out_mistral_7b/tokenizer.model \
    --huggingface_cache=./weights_conversion/cache_mistral_7b/ \
    --huggingface_device=cuda:1 \
    $COMMON_ARGS $LLAMA_ARGS 

Assert:

I am hitting the following error:

 > loading shuffle-idx mapping from ./my_long_corpus_mistral/my_long_corpus_4096_mistral_text_document_valid_indexmap_100ns_32768sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of tokens: 1229100
    total number of samples: 113
    total number of epochs: 3
 > WARNING: could not find index map files, building the indices on rank 0 ...
[rank0]: Traceback (most recent call last):
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/verify_correctness.py", line 217, in <module>
[rank0]:     main()
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/verify_correctness.py", line 179, in main
[rank0]:     data_iterator, _, _ = build_train_valid_test_data_iterators(
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/training.py", line 911, in build_train_valid_test_data_iterators
[rank0]:     train_ds, valid_ds, test_ds = build_train_valid_test_datasets_provider(
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/finetune.py", line 179, in data_provider
[rank0]:     train_ds, valid_ds, test_ds = builder(
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 35, in build_train_valid_test_datasets
[rank0]:     return _build_train_valid_test_datasets(data_prefix[0],
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 201, in _build_train_valid_test_datasets
[rank0]:     test_dataset = _f(2, 'test')
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 193, in _f
[rank0]:     dataset = GPTDataset(name, data_prefix,
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 234, in __init__
[rank0]:     self.doc_idx, self.sample_idx, self.shuffle_idx = _build_index_mappings(
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 324, in _build_index_mappings
[rank0]:     assert last_epoch_num_samples < (num_samples_per_epoch + 1), \
[rank0]: AssertionError: last epoch number of samples exceeded max value.
E0901 18:13:33.273000 139940874703296 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1963981) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0a0+f70bd71a48.nv24.6', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 900, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 891, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
verify_correctness.py FAILED
------------------------------------------------------------
abgoswam commented 2 weeks ago

cc @martinjaggi

abgoswam commented 1 week ago

also i can repro the same error for meta-llama/Llama-2-7b-hf

DISTRIBUTED_ARGS="--nproc_per_node 1 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
LLAMA_ARGS="--use_rms_norm --glu_activation swiglu --no_tie_embed_logits --no_new_tokens --layernorm_epsilon 1e-5"
COMMON_ARGS="--hidden_dropout 0.0 --attention_dropout 0.0 --no_bias_gelu_fusion"

torchrun $DISTRIBUTED_ARGS verify_correctness.py \
    --model_name=llama2 \
    --model_size=7 \
    --load=./weights_conversion/out_llama2_7b \
    --data_path=./my_long_corpus_llama2/my_long_corpus_128_llama2_text_document \
    --tokenizer_type=SentencePieceTokenizer \
    --vocab_file=./weights_conversion/out_llama2_7b/tokenizer.model \
    --huggingface_cache=./weights_conversion/cache_llama2_7b/ \
    --huggingface_device=cuda:1 \
    $COMMON_ARGS $LLAMA_ARGS 

Pasting full error:

/home/aiscuser/.local/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting seq_length to 4096 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to rotary from checkpoint
Setting bias_droput_fusion to False from checkpoint
Setting parallel_attn to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
WARNING: overriding default arguments for use_checkpoint_args:True                        with use_checkpoint_args:False
setting global batch size to 1
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... True
  attention_dropout ............................... 0.0
  attention_softmax_in_fp32 ....................... False
  barrier_with_L1_time ............................ True
  baseline_device ................................. cuda:1
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_droput_fusion .............................. False
  bias_gelu_fusion ................................ False
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  cache_dir ....................................... weights_conversion/cache_llama2_7b
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  data_impl ....................................... infer
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 1
  data_path ....................................... ['./my_long_corpus_llama2/my_long_corpus_128_llama2_text_document']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  data_type ....................................... gpt
  dataloader_type ................................. single
  DDP_impl ........................................ local
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  encoder_num_layers .............................. 32
  encoder_seq_length .............................. 4096
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  eval_interval ................................... 1000
  eval_iters ...................................... 100
  eval_only ....................................... False
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_signal_handler ............................. False
  ffn_hidden_size ................................. 11008
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_e4m3 ........................................ False
  fp8_hybrid ...................................... False
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 1
  glu_activation .................................. swiglu
  gradient_accumulation_fusion .................... True
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.0
  hidden_size ..................................... 4096
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  iteration ....................................... release
  kv_channels ..................................... 128
  layernorm_epsilon ............................... 1e-05
  lima_dropout .................................... False
  load ............................................ ./weights_conversion/out_llama2_7b
  load_iters ...................................... None
  local_rank ...................................... None
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 100
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 1.0
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. linear
  lr_warmup_fraction .............................. None
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  mask_prob ....................................... 0.15
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 4096
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  metrics ......................................... []
  micro_batch_size ................................ 1
  min_loss_scale .................................. 1.0
  min_lr .......................................... 0.0
  mmap_warmup ..................................... False
  model_name ...................................... llama2
  model_size ...................................... 7
  model_type ...................................... encoder_or_decoder
  new_tokens ...................................... False
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  num_attention_heads ............................. 32
  num_attention_heads_kv .......................... 32
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_layers ...................................... 32
  num_layers_per_virtual_pipeline_stage ........... None
  num_workers ..................................... 2
  onnx_safe ....................................... None
  optimizer ....................................... adam
  override_opt_param_scheduler .................... False
  padded_vocab_size ............................... 32000
  parallel_attn ................................... False
  parallel_layernorm .............................. False
  params_dtype .................................... torch.float32
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... PositionEmbeddingType.rotary
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ 1
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  rope_scaling_factor ............................. 1.0
  rope_theta ...................................... 10000.0
  sample_rate ..................................... 1.0
  save ............................................ None
  save_interval ................................... None
  scalar_loss_mask ................................ 0.0
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 4096
  sequence_parallel ............................... False
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_iters ...................................... []
  sliding_window_size ............................. None
  split ........................................... 969, 30, 1
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  tie_embed_logits ................................ False
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. None
  tokenizer_type .................................. SentencePieceTokenizer
  train_data_path ................................. None
  train_iters ..................................... 10
  train_samples ................................... None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 1
  use_bias ........................................ False
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_contiguous_buffers_in_local_ddp ............. True
  use_cpu_initialization .......................... None
  use_distributed_optimizer ....................... False
  use_flash_attn .................................. False
  use_one_sent_docs ............................... False
  use_post_ln ..................................... False
  use_ring_exchange_p2p ........................... False
  use_rms_norm .................................... True
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vocab_extra_ids ................................. 0
  vocab_extra_ids_list ............................ None
  vocab_file ...................................... ./weights_conversion/out_llama2_7b/tokenizer.model
  wandb_api_key ................................... None
  wandb_entity .................................... meditron
  wandb_id ........................................ None
  wandb_logger .................................... False
  wandb_name ...................................... None
  wandb_project ................................... None
  wandb_resume .................................... allow
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 1
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1
> building SentencePieceTokenizer tokenizer ...
Special tokens: {'<s>': 1, '</s>': 2}
 > padded vocab (size: 32000) with 0 dummy tokens (new size: 32000)
> initializing torch distributed ...
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
Starting megatron vs huggingface verification
[rank0]:[W901 18:43:30.548352164 init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Loading our model!
Getting megatron model...
Building model ...
/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/model/llama_model.py:36: UserWarning: Llama is not intended to use bias_dropout_fusion
  warnings.warn("Llama is not intended to use bias_dropout_fusion")
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 6738415616
/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/optimizer/optimizer.py:711: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:78.)
  self._scale = torch.cuda.FloatTensor([1.0])
> learning rate decay style: linear
node-0:2046259:2046259 [0] NCCL INFO Bootstrap : Using eth0:10.8.37.80<0>
node-0:2046259:2046259 [0] NCCL INFO cudaDriverVersion 12050
NCCL version 2.21.5+cuda12.5
node-0:2046259:2047044 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
node-0:2046259:2047044 [0] NCCL INFO P2P plugin IBext_v8
node-0:2046259:2047044 [0] NCCL INFO NET/IB : No device found.
node-0:2046259:2047044 [0] NCCL INFO NET/IB : No device found.
node-0:2046259:2047044 [0] NCCL INFO NET/Socket : Using [0]eth0:10.8.37.80<0>
node-0:2046259:2047044 [0] NCCL INFO Using non-device net plugin version 0
node-0:2046259:2047044 [0] NCCL INFO Using network Socket
node-0:2046259:2047044 [0] NCCL INFO ncclCommInitRank comm 0x55bd768d3e40 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 commId 0xd549aec3ffbca01b - Init START
node-0:2046259:2047044 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/00000041-0001-0000-3130-444532304235/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
node-0:2046259:2047044 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/00000041-0001-0000-3130-444532304235/pci0001:00/0001:00:00.0/../max_link_width, ignoring
node-0:2046259:2047044 [0] NCCL INFO === System : maxBw 5000.0 totalBw 5000.0 ===
node-0:2046259:2047044 [0] NCCL INFO CPU/0-0 (1/2/-1)
node-0:2046259:2047044 [0] NCCL INFO + PCI[5000.0] - NIC/0-0
node-0:2046259:2047044 [0] NCCL INFO + PCI[24.0] - GPU/0-100000 (0)
node-0:2046259:2047044 [0] NCCL INFO ==========================================
node-0:2046259:2047044 [0] NCCL INFO GPU/100000 :GPU/0-100000 (0/5000.0/LOC) CPU/0-0 (1/24.0/PHB) 
node-0:2046259:2047044 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
node-0:2046259:2047044 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 16, bw 40.000000/40.000000, type LOC/PIX, sameChannels 1
node-0:2046259:2047044 [0] NCCL INFO  0 : GPU/0
node-0:2046259:2047044 [0] NCCL INFO  1 : GPU/0
....
....
node-0:2046259:2047044 [0] NCCL INFO P2P Chunksize set to 131072
node-0:2046259:2047044 [0] NCCL INFO Connected all rings
node-0:2046259:2047044 [0] NCCL INFO Connected all trees
node-0:2046259:2047044 [0] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
node-0:2046259:2047044 [0] NCCL INFO ncclCommInitRank comm 0x55bd768d3e40 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 commId 0xd549aec3ffbca01b - Init COMPLETE
 loading release checkpoint from ./weights_conversion/out_llama2_7b
 checkpoint version 3.0
  successfully loaded checkpoint from ./weights_conversion/out_llama2_7b at iteration 0
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:3199: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
(min, max) time across ranks (ms):
    load-checkpoint ................................: (5776.25, 5776.25)
Loading baseline model!
Getting huggingface model...

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:01<00:01,  1.47s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.16s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.21s/it]
Loading dataset!
> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      10
    validation: 100
    test:       100
> building train, validation, and test datasets ...
Single data path provided for train, valid & test
 > building dataset index ...
    reading sizes...
    reading pointers...
    reading document index...
    creating numpy buffer of mmap...
    creating memory view of numpy buffer...
 > finished creating indexed dataset in 0.000472 seconds
    number of documents: 10000
    number of tokens: 1280000
 > dataset split:
    train:
     document indices in [0, 9690) total of 9690 documents
    validation:
     document indices in [9690, 9990) total of 300 documents
    test:
     document indices in [9990, 10000) total of 10 documents
node-0:2046259:2049109 [0] NCCL INFO Using non-device net plugin version 0
node-0:2046259:2049109 [0] NCCL INFO Using network Socket
node-0:2046259:2049109 [0] NCCL INFO bootstrapSplit: comm 0x55bd7bb635c0 parent 0x55bd768d3e40 rank 0 nranks 1 color -1091263299 key 0 prev 0 next 0 - DONE
node-0:2046259:2049109 [0] NCCL INFO ncclCommSplit comm 0x55bd7bb635c0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 parent 0x55bd768d3e40 color -1091263299 key 0 commId 0xa7fa1bf3e8499106 - Init START
node-0:2046259:2049109 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/00000041-0001-0000-3130-444532304235/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
node-0:2046259:2049109 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/00000041-0001-0000-3130-444532304235/pci0001:00/0001:00:00.0/../max_link_width, ignoring
node-0:2046259:2049109 [0] NCCL INFO === System : maxBw 5000.0 totalBw 5000.0 ===
node-0:2046259:2049109 [0] NCCL INFO CPU/0-0 (1/2/-1)
node-0:2046259:2049109 [0] NCCL INFO + PCI[5000.0] - NIC/0-0
node-0:2046259:2049109 [0] NCCL INFO + PCI[24.0] - GPU/0-100000 (0)
node-0:2046259:2049109 [0] NCCL INFO ==========================================
node-0:2046259:2049109 [0] NCCL INFO GPU/100000 :GPU/0-100000 (0/5000.0/LOC) CPU/0-0 (1/24.0/PHB) 
node-0:2046259:2049109 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
node-0:2046259:2049109 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 16, bw 40.000000/40.000000, type LOC/PIX, sameChannels 1
...
...
node-0:2046259:2049109 [0] NCCL INFO P2P Chunksize set to 131072
node-0:2046259:2049109 [0] NCCL INFO Connected all rings
node-0:2046259:2049109 [0] NCCL INFO Connected all trees
node-0:2046259:2049109 [0] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
node-0:2046259:2049109 [0] NCCL INFO ncclCommSplit comm 0x55bd7bb635c0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 parent 0x55bd768d3e40 color -1091263299 key 0 commId 0xa7fa1bf3e8499106 - Init COMPLETE
node-0:2046259:2049114 [0] NCCL INFO Using non-device net plugin version 0
node-0:2046259:2049114 [0] NCCL INFO Using network Socket
node-0:2046259:2049114 [0] NCCL INFO bootstrapSplit: comm 0x55bd7bb6bec0 parent 0x55bd768d3e40 rank 0 nranks 1 color -1091263299 key 0 prev 0 next 0 - DONE
node-0:2046259:2049114 [0] NCCL INFO ncclCommSplit comm 0x55bd7bb6bec0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 parent 0x55bd768d3e40 color -1091263299 key 0 commId 0xa7fa1bf3e8499106 - Init START
node-0:2046259:2049114 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/00000041-0001-0000-3130-444532304235/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
node-0:2046259:2049114 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/00000041-0001-0000-3130-444532304235/pci0001:00/0001:00:00.0/../max_link_width, ignoring
node-0:2046259:2049114 [0] NCCL INFO === System : maxBw 5000.0 totalBw 5000.0 ===
node-0:2046259:2049114 [0] NCCL INFO CPU/0-0 (1/2/-1)
node-0:2046259:2049114 [0] NCCL INFO + PCI[5000.0] - NIC/0-0
node-0:2046259:2049114 [0] NCCL INFO + PCI[24.0] - GPU/0-100000 (0)
node-0:2046259:2049114 [0] NCCL INFO ==========================================
node-0:2046259:2049114 [0] NCCL INFO GPU/100000 :GPU/0-100000 (0/5000.0/LOC) CPU/0-0 (1/24.0/PHB) 
node-0:2046259:2049114 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
node-0:2046259:2049114 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 16, bw 40.000000/40.000000, type LOC/PIX, sameChannels 1
...
...
node-0:2046259:2049114 [0] NCCL INFO P2P Chunksize set to 131072
node-0:2046259:2049114 [0] NCCL INFO Connected all rings
node-0:2046259:2049114 [0] NCCL INFO Connected all trees
node-0:2046259:2049114 [0] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
node-0:2046259:2049114 [0] NCCL INFO ncclCommSplit comm 0x55bd7bb6bec0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 parent 0x55bd768d3e40 color -1091263299 key 0 commId 0xa7fa1bf3e8499106 - Init COMPLETE
 > loading doc-idx mapping from ./my_long_corpus_llama2/my_long_corpus_128_llama2_text_document_train_indexmap_10ns_4096sl_1234s_doc_idx.npy
 > loading sample-idx mapping from ./my_long_corpus_llama2/my_long_corpus_128_llama2_text_document_train_indexmap_10ns_4096sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from ./my_long_corpus_llama2/my_long_corpus_128_llama2_text_document_train_indexmap_10ns_4096sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of tokens: 1240320
    total number of samples: 303
    total number of epochs: 1
 > loading doc-idx mapping from ./my_long_corpus_llama2/my_long_corpus_128_llama2_text_document_valid_indexmap_100ns_4096sl_1234s_doc_idx.npy
 > loading sample-idx mapping from ./my_long_corpus_llama2/my_long_corpus_128_llama2_text_document_valid_indexmap_100ns_4096sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from ./my_long_corpus_llama2/my_long_corpus_128_llama2_text_document_valid_indexmap_100ns_4096sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of tokens: 38400
    total number of samples: 104
    total number of epochs: 11
 > WARNING: could not find index map files, building the indices on rank 0 ...
[rank0]: Traceback (most recent call last):
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/verify_correctness.py", line 217, in <module>
[rank0]:     main()
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/verify_correctness.py", line 179, in main
[rank0]:     data_iterator, _, _ = build_train_valid_test_data_iterators(
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/training.py", line 911, in build_train_valid_test_data_iterators
[rank0]:     train_ds, valid_ds, test_ds = build_train_valid_test_datasets_provider(
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/finetune.py", line 179, in data_provider
[rank0]:     train_ds, valid_ds, test_ds = builder(
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 35, in build_train_valid_test_datasets
[rank0]:     return _build_train_valid_test_datasets(data_prefix[0],
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 201, in _build_train_valid_test_datasets
[rank0]:     test_dataset = _f(2, 'test')
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 193, in _f
[rank0]:     dataset = GPTDataset(name, data_prefix,
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 234, in __init__
[rank0]:     self.doc_idx, self.sample_idx, self.shuffle_idx = _build_index_mappings(
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 324, in _build_index_mappings
[rank0]:     assert last_epoch_num_samples < (num_samples_per_epoch + 1), \
[rank0]: AssertionError: last epoch number of samples exceeded max value.
E0901 18:44:13.294000 140017019376064 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 2046259) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0a0+f70bd71a48.nv24.6', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 900, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 891, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
verify_correctness.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-01_18:44:13
  host      : node-0
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2046259)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
abgoswam commented 1 week ago

it seems the error happens when building the index for "test"

setting the test split to 0 gets me past the assert

--split 950,50,0

with this the verification for llama2 works fine.