Closed PurvangL closed 4 months ago
@maanug-nv , could you look at this one?
@ericharper , @maanug-nv ; I also tried running on slurm cluster, please find logs below.
[NeMo I 2024-05-23 11:06:21 megatron_gpt_sft:176]
name: megatron_gpt_sft
trainer:
devices:8
accelerator:gpu
num_nodes:2
precision:bf16
logger:false
enable_checkpointing:false
use_distributed_sampler:false
max_epochs:9999
max_steps:50
log_every_n_steps:10
val_check_interval:1.0
gradient_clip_val:1.0
exp_manager:
explicit_log_dir:/workspace/result
exp_dir:null
name:${name}
create_wandb_logger:false
wandb_logger_kwargs:
project:null
name:null
resume_if_exists:true
resume_ignore_no_checkpoint:true
create_checkpoint_callback:true
checkpoint_callback_params:
monitor: validation_loss
save_top_k:2
mode:max
save_nemo_on_train_end:true
filename: megatron_gpt_sft--{${exp_manager.checkpoint_callback_params.monitor}:.3f}-{step}-{consumed_samples}
model_parallel_size:${model.tensor_model_parallel_size}
save_best_model:false
model:
seed:1234
tensor_model_parallel_size:4
pipeline_model_parallel_size:4
global_batch_size:128
micro_batch_size:1
restore_from_path:/workspace/llama27b.nemo
resume_from_checkpoint:null
save_nemo_on_validation_end:true
sync_batch_comm:false
megatron_amp_O2:true
sequence_parallel:true
activations_checkpoint_granularity:selective
activations_checkpoint_method: uniform
activations_checkpoint_num_layers: null
activations_checkpoint_layers_per_pipeline: null
answer_only_loss: true
gradient_as_bucket_view: false
seq_len_interpolation_factor: null
use_flash_attention: null
hidden_dropout: 0.0
attention_dropout: 0.0
ffn_dropout: 0.0
data:
chat: false
chat_prompt_tokens:
system_turn_start: <extra_id_0>
turn_start: <extra_id_1>
label_start: <extra_id_2>
end_of_turn: '
'
end_of_name: '
'
train_ds:
file_names:
- /workspace/self_instruct_data/training.jsonl
global_batch_size: 128
micro_batch_size: 1
shuffle: true
num_workers: 0
memmap_workers: null
pin_memory: true
max_seq_length: 512
min_seq_length: 1
drop_last: true
concat_sampling_probabilities:
- 1
label_key: output
add_eos: true
add_sep: false
add_bos: false
truncation_field: input
index_mapping_dir: null
prompt_template: '{input} {output}'
hf_dataset: false
truncation_method: right
validation_ds:
file_names:
- /workspace/self_instruct_data/validation.jsonl
names: null
global_batch_size: 128
micro_batch_size: 1
shuffle: false
num_workers: 0
memmap_workers: ${model.data.train_ds.memmap_workers}
pin_memory: true
max_seq_length: 512
min_seq_length: 1
drop_last: false
label_key: ${model.data.train_ds.label_key}
add_eos: ${model.data.train_ds.add_eos}
add_sep: ${model.data.train_ds.add_sep}
add_bos: ${model.data.train_ds.add_bos}
write_predictions_to_file: false
output_file_path_prefix: null
truncation_field: ${model.data.train_ds.truncation_field}
index_mapping_dir: null
prompt_template: ${model.data.train_ds.prompt_template}
tokens_to_generate: 32
hf_dataset: false
truncation_method: right
metric:
name: loss
average: null
num_classes: null
test_ds:
file_names:
- /workspace/self_instruct_data/test.jsonl
names: null
global_batch_size: 256
micro_batch_size: 1
shuffle: false
num_workers: 0
memmap_workers: ${model.data.train_ds.memmap_workers}
pin_memory: true
max_seq_length: ${model.data.train_ds.max_seq_length}
min_seq_length: 1
drop_last: false
label_key: ${model.data.train_ds.label_key}
add_eos: ${model.data.train_ds.add_eos}
add_sep: ${model.data.train_ds.add_sep}
add_bos: ${model.data.train_ds.add_bos}
write_predictions_to_file: false
output_file_path_prefix: null
truncation_field: ${model.data.train_ds.truncation_field}
index_mapping_dir: null
prompt_template: ${model.data.train_ds.prompt_template}
tokens_to_generate: 32
hf_dataset: false
truncation_method: right
metric:
name: loss
average: null
num_classes: null
optim:
name: distributed_fused_adam
lr: 5.0e-06
weight_decay: 0.01
betas:
- 0.9
- 0.98
inference:
greedy: true
top_k: 0
top_p: 0.9
temperature: 1.0
all_probs: false
repetition_penalty: 1.2
min_tokens_to_generate: 0
compute_logprob: false
compute_attention_mask: true
cluster_type: BCP
[NeMo W 2024-05-23 11:06:21 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/lightning_fabric/connector.py:554: UserWarning: bf16 is supported for historical reasons but its usage is discouraged. Please se
t your precision to bf16-mixed instead!
rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
[NeMo E 2024-05-23 11:06:21 exp_manager:556] You are running multi-node training without SLURM handling the processes. Please note that this is not tested in NeMo and could result in errors.
[NeMo W 2024-05-23 11:06:21 exp_manager:708] Exp_manager is logging to /workspace/result, but it already exists.
[NeMo W 2024-05-23 11:06:21 exp_manager:630] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/workspace/result/checkpoints. Training from scratch.
[NeMo I 2024-05-23 11:06:21 exp_manager:396] Experiments will be logged at /workspace/result
[NeMo I 2024-05-23 11:06:21 exp_manager:856] TensorboardLogger has been set up
[NeMo W 2024-05-23 11:06:21 exp_manager:966] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 50. Please ensure that max_steps will run for at least 1 epochs to ensu
re that checkpointing will not error out.
[NeMo I 2024-05-23 11:06:21 megatron_gpt_sft:213] Resuming training from checkpoint: None
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurab
le.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configur
able.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it confi
gurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configura
ble.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurabl
e.
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[NeMo I 2024-05-23 11:06:28 megatron_init:253] Rank 0 has data parallel group : [0]
[NeMo I 2024-05-23 11:06:28 megatron_init:259] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-05-23 11:06:28 megatron_init:264] All data parallel group ranks with context parallel combined: [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:267] Ranks 0 has data parallel rank: 0
[NeMo I 2024-05-23 11:06:28 megatron_init:284] Rank 0 has context parallel group: [0]
[NeMo I 2024-05-23 11:06:28 megatron_init:287] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:288] Ranks 0 has context parallel rank: 0
[NeMo I 2024-05-23 11:06:28 megatron_init:299] Rank 0 has model parallel group: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[NeMo I 2024-05-23 11:06:28 megatron_init:300] All model parallel group ranks: [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:310] Rank 0 has tensor model parallel group: [0, 1, 2, 3]
[NeMo I 2024-05-23 11:06:28 megatron_init:314] All tensor model parallel group ranks: [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:315] Rank 0 has tensor model parallel rank: 0
[NeMo I 2024-05-23 11:06:28 megatron_init:344] Rank 0 has pipeline model parallel group: [0, 4, 8, 12]
[NeMo I 2024-05-23 11:06:28 megatron_init:356] Rank 0 has embedding group: [0, 12]
[NeMo I 2024-05-23 11:06:28 megatron_init:362] All pipeline model parallel group ranks: [[0, 4, 8, 12], [1, 5, 9, 13], [2, 6, 10, 14], [3, 7, 11, 15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:363] Rank 0 has pipeline model parallel rank 0
[NeMo I 2024-05-23 11:06:28 megatron_init:364] All embedding group ranks: [[0, 4, 8, 12], [1, 5, 9, 13], [2, 6, 10, 14], [3, 7, 11, 15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:365] Rank 0 has embedding rank: 0
24-05-23 11:06:28 - PID:154683 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 128
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurab
le.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configur
able.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it confi
gurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configura
ble.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurabl
e.
[NeMo I 2024-05-23 11:06:28 tokenizer_utils:185] Getting SentencePiece with model: /tmp/tmpyuitwp3o/a290efe8ded54b8da6a27eb8ecea4895_tokenizer.model
[NeMo I 2024-05-23 11:06:28 megatron_base_model:574] Padded vocab_size: 32256, original vocab_size: 32000, dummy tokens: 256.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurab
le.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configur
able.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it confi
gurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configura
ble.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurabl
e.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:489] apply_query_key_layer_scaling is only enabled when using FP16, setting it to False and setting NVTE_APPLY_QK_LAYER_SCALING=0
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: add_qkv_bias in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: num_moe_experts in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: rotary_interleaved in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: window_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: memory_efficient_layer_norm in its cfg. Add this key to cfg or config_mapping to make to make it confi
gurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: fp8_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: clone_scatter_output_in_embedding in its cfg. Add this key to cfg or config_mapping to make to make it
configurable.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/16
[I socket.cpp:480] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:531] [c10d - debug] The server socket is attempting to listen on [::]:12312.
[I socket.cpp:605] [c10d] The server socket has started to listen on [::]:12312.
[I TCPStore.cpp:305] [c10d - debug] The server has started on port = 12312.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/16
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/16
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/16
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
Matplotlib created a temporary cache directory at /tmp/matplotlib-518pojrm because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-qze86_xq because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-rrzai_1n because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-449boyjo because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-9w_wgl4h because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-d5pwia0k because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-w0euwkph because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-wgywjpl6 because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Initializing distributed: GLOBAL_RANK: 13, MEMBER: 14/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 9, MEMBER: 10/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Initializing distributed: GLOBAL_RANK: 15, MEMBER: 16/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 11, MEMBER: 12/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 8, MEMBER: 9/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Initializing distributed: GLOBAL_RANK: 10, MEMBER: 11/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 14, MEMBER: 15/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 12, MEMBER: 13/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
…. Continue retrying
@ericharper @maanug-nv , following up regarding issue posted above. please let me know if any other information needed.
Hi @PurvangL , I see you've closed this issue, were you able to resolve? I haven't had time to reproduce this issue with SFT, but I've encountered long init times with pretraining that might seem like hangs, but eventually start training. Sorry for lack of response, if I can get around to reproducing this specific case, I'll let you know. We are also looking into these long init times.
Hi @maanug-nv Removing NCCL_P2P_LEVEL= NVL or PIX environment variables and increase per process memory to infinity helped.
Describe the bug
I am following guide to fine tune llama2-7B model on 2 nodes (H100).
my training hangs at dalaloader sanity checking.
A clear and concise description of what the bug is.
Steps/Code to reproduce bug
docker image: nvcr.io/nvidia/nemo:24.03.01.framework follow guide to run llama2-7B command I run on each node
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
docker pull
&docker run
commands used docker pull nvcr.io/nvidia/nemo:24.03.01.frameworkEnvironment details
If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:
Additional context
Add any other context about the problem here. Example: GPU model : 16xH100
Please let me know if any other information needed. Thank you