Getting started "shard" model not working

philschmid commented 10 months ago

First, of all thank you for creating this project! It looks very exciting and interesting due its close Hugging Face Integration. I am very curious and wanted to give it a try following the Getting Started Guide in the documentation. But i ran into an error during the "Model Sharding" resulting into a Bus error (core dumped).

I am running on a single Node 8x A100 80GB with 1TB of memory. I followed the exact same step in the guide and used the container.

below is the full error stack in case its helpful. It includes quite a lot of weird C errors/warning in the beginning. I installed the package with

cd Megatron-LLM
pip install -r requirements.txt
cd megatron/data/
make
cd ../../

in the container.

Error Stack

```ba /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = float; U = float; V = c10::Half]’: /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:95: required from here /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 737 | cuComputePartGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 737 | cuComputePartGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 737 | cuComputePartGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 750 | cuComputeGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 750 | cuComputeGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 768 | cuComputeGradInput<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = float; U = float; V = c10::BFloat16]’: /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:103: required from here /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 737 | cuComputePartGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 737 | cuComputePartGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 737 | cuComputePartGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 750 | cuComputeGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 750 | cuComputeGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 768 | cuComputeGradInput<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = c10::Half; U = float; V = c10::Half]’: /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:127: required from here /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 737 | cuComputePartGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 737 | cuComputePartGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 737 | cuComputePartGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 750 | cuComputeGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 750 | cuComputeGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 768 | cuComputeGradInput<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = c10::BFloat16; U = float; V = c10::BFloat16]’: /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:138: required from here /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = c10::BFloat16]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 737 | cuComputePartGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 737 | cuComputePartGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 737 | cuComputePartGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 750 | cuComputeGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 750 | cuComputeGradGammaBeta<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = c10::BFloat16]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 768 | cuComputeGradInput<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here 245 | T * data() const { | ^ ~~ [3/3] c++ layer_norm_cuda.o layer_norm_cuda_kernel.cuda.o -shared -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_mix_prec_layer_norm_cuda.so Loading extension module fused_mix_prec_layer_norm_cuda... Detected CUDA files, patching ldflags Emitting ninja build file /epfllm/Megatron-LLM/megatron/fused_kernels/build/build.ninja... Building extension module fused_dense_cuda... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] c++ -MMD -MF fused_weight_gradient_dense.o.d -DTORCH_EXTENSION_NAME=fused_dense_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -O3 -c /epfllm/Megatron-LLM/megatron/fused_kernels/fused_weight_gradient_dense.cpp -o fused_weight_gradient_dense.o [2/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_dense_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -gencode arch=compute_80,code=sm_80 -std=c++17 -c /epfllm/Megatron-LLM/megatron/fused_kernels/fused_weight_gradient_dense.cu -o fused_weight_gradient_dense.cuda.o [3/3] c++ fused_weight_gradient_dense.o fused_weight_gradient_dense.cuda.o -shared -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_dense_cuda.so Loading extension module fused_dense_cuda... Building model ... /epfllm/Megatron-LLM/megatron/model/llama_model.py:38: UserWarning: Llama is not intended to use dropout warnings.warn( "Llama is not intended to use dropout") /epfllm/Megatron-LLM/megatron/model/llama_model.py:40: UserWarning: Llama is not intended to use dropout warnings.warn( "Llama is not intended to use dropout") loading release checkpoint from ./model checkpoint version 3.0 successfully loaded checkpoint from ./model at iteration 0 using world size: 4, data-parallel-size: 1, tensor-model-parallel size: 4, pipeline-model-parallel size: 1 setting global batch size to 1 accumulate and all-reduce gradients in fp32 for bfloat16 data type. using torch.bfloat16 for parameters ... ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 .............. True adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_query_key_layer_scaling ................... True apply_residual_connection_post_layernorm ........ False async_tensor_model_parallel_allreduce ........... True attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False barrier_with_L1_time ............................ True bert_load ....................................... None bf16 ............................................ True bias_dropout_fusion ............................. False bias_gelu_fusion ................................ False biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None classes_fraction ................................ 1.0 clip_grad ....................................... 1.0 consumed_train_samples .......................... 0 consumed_valid_samples .......................... 0 data_impl ....................................... infer data_parallel_random_init ....................... False data_parallel_size .............................. 1 data_path ....................................... None data_per_class_fraction ......................... 1.0 data_sharding ................................... True dataloader_type ................................. single DDP_impl ........................................ local decoder_num_layers .............................. None decoder_seq_length .............................. None dino_bottleneck_size ............................ 256 dino_freeze_last_layer .......................... 1 dino_head_hidden_size ........................... 2048 dino_local_crops_number ......................... 10 dino_local_img_size ............................. 96 dino_norm_last_layer ............................ False dino_teacher_temp ............................... 0.07 dino_warmup_teacher_temp ........................ 0.04 dino_warmup_teacher_temp_epochs ................. 30 distribute_saved_activations .................... False distributed_backend ............................. nccl embedding_path .................................. None empty_unused_memory_level ....................... 0 encoder_num_layers .............................. 32 encoder_seq_length .............................. 4096 end_weight_decay ................................ 0.01 eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 100 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None exit_signal_handler ............................. False ffn_hidden_size ................................. 11008 finetune ........................................ False fp16 ............................................ False fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False fp8_amax_compute_algo ........................... most_recent fp8_amax_history_len ............................ 1 fp8_e4m3 ........................................ False fp8_hybrid ...................................... False fp8_interval .................................... 1 fp8_margin ...................................... 0 fp8_wgrad ....................................... True global_batch_size ............................... 1 glu_activation .................................. swiglu gradient_accumulation_fusion .................... True head_lr_mult .................................... 1.0 hidden_dropout .................................. 0.1 hidden_size ..................................... 4096 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_h ........................................... 224 img_w ........................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 inference_batch_times_seqlen_threshold .......... 512 init_method_std ................................. 0.02 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 iter_per_epoch .................................. 1250 kv_channels ..................................... 128 layernorm_epsilon ............................... 1e-05 lima_dropout .................................... False load ............................................ None local_rank ...................................... None log_batch_size_to_tensorboard ................... False log_interval .................................... 100 log_memory_to_tensorboard ....................... False log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False log_world_size_to_tensorboard ................... False loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. None lr_decay_iters .................................. None lr_decay_samples ................................ None lr_decay_style .................................. linear lr_warmup_fraction .............................. None lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 128 mask_prob ....................................... 0.15 masked_softmax_fusion ........................... False max_position_embeddings ......................... 4096 max_tokens_to_oom ............................... 12000 merge_file ...................................... None metrics ......................................... [] micro_batch_size ................................ 1 min_loss_scale .................................. 1.0 min_lr .......................................... 0.0 mmap_warmup ..................................... False new_tokens ...................................... True no_load_optim ................................... True no_load_rng ..................................... True no_persist_layer_norm ........................... False no_save_optim ................................... True no_save_rng ..................................... True num_attention_heads ............................. 32 num_attention_heads_kv .......................... 32 num_channels .................................... 3 num_classes ..................................... 1000 num_layers ...................................... 32 num_layers_per_virtual_pipeline_stage ........... None num_workers ..................................... 2 onnx_safe ....................................... None optimizer ....................................... adam override_opt_param_scheduler .................... False parallel_attn ................................... False parallel_layernorm .............................. False params_dtype .................................... torch.bfloat16 patch_dim ....................................... 16 perform_initialization .......................... False pipeline_model_parallel_size .................... 1 pipeline_model_parallel_split_rank .............. None position_embedding_type ......................... PositionEmbeddingType.rotary query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 recompute_granularity ........................... None recompute_method ................................ None recompute_num_layers ............................ 1 reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 rope_scaling_factor ............................. 1.0 rope_theta ...................................... 10000.0 sample_rate ..................................... 1.0 save ............................................ ./model_sharded save_interval ................................... 1 scalar_loss_mask ................................ 0.0 scatter_gather_tensors_in_pipeline .............. True seed ............................................ 1234 seq_length ...................................... 4096 sequence_parallel ............................... False sgd_momentum .................................... 0.9 short_seq_prob .................................. 0.1 skip_iters ...................................... [] split ........................................... 969, 30, 1 standalone_embedding_stage ...................... False start_weight_decay .............................. 0.01 tensor_model_parallel_size ...................... 4 tensorboard_dir ................................. None tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 test_data_path .................................. None tie_embed_logits ................................ False timing_log_level ................................ 0 timing_log_option ............................... minmax titles_data_path ................................ None tokenizer_model ................................. None tokenizer_type .................................. SentencePieceTokenizer train_data_path ................................. None train_iters ..................................... None train_samples ................................... None transformer_impl ................................ local transformer_pipeline_model_parallel_size ........ 1 use_bias ........................................ False use_checkpoint_args ............................. False use_checkpoint_opt_param_scheduler .............. False use_contiguous_buffers_in_local_ddp ............. True use_cpu_initialization .......................... True use_distributed_optimizer ....................... False use_flash_attn .................................. False use_one_sent_docs ............................... False use_post_ln ..................................... False use_ring_exchange_p2p ........................... False use_rms_norm .................................... True valid_data_path ................................. None variable_seq_lengths ............................ False virtual_pipeline_model_parallel_size ............ None vocab_extra_ids ................................. 0 vocab_extra_ids_list ............................ None vocab_file ...................................... None wandb_api_key ................................... None wandb_entity .................................... meditron wandb_id ........................................ None wandb_logger .................................... False wandb_project ................................... None wandb_resume .................................... False weight_decay .................................... 0.01 weight_decay_incr_style ......................... constant world_size ...................................... 4 -------------------- end of arguments --------------------- setting number of micro-batches to constant 1 Setting consumed_train_samples to 0 and consumed_valid_samples to 0 sending embeddings sending lm_head Detected CUDA files, patching ldflags Emitting ninja build file /epfllm/Megatron-LLM/megatron/fused_kernels/build/build.ninja... sending transformer layer 0 Building extension module fused_mix_prec_layer_norm_cuda... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_mix_prec_layer_norm_cuda... Detected CUDA files, patching ldflags Emitting ninja build file /epfllm/Megatron-LLM/megatron/fused_kernels/build/build.ninja... Building extension module fused_dense_cuda... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_dense_cuda... Bus error (core dumped) ```

kylematoba commented 10 months ago

Thank you for your interest in our project.

The Apex compilation warnings are expected, I have seen these since the beginning.

The warnings.warn( "Llama is not intended to use dropout") warnings are also fine. We should probably turn these off.

I replicate your problem when following the docs as written (also using a single node with 8x A100 80gb).

When I invoke docker with the additional arguments

--shm-size=128gb \
--ulimit memlock=-1 \ 
--ulimit stack=67108864 \
 --memory 480G

however, it runs as expected. Please try something like this.

kylematoba commented 10 months ago

@AleHD please can you add this, or at least a mention that nontrivial memory is needed to shard the weights, to the "Getting Started" section? Thanks!

philschmid commented 10 months ago

Thank you @kylematoba that solved it for me. I managed to shard the model but ran into a different issue during training.

Traceback (most recent call last):
  File "/epfllm/./Megatron-LLM/finetune.py", line 249, in <module>
    pretrain(args, data_provider, model_provider,  ModelType.encoder_or_decoder,
  File "/epfllm/Megatron-LLM/megatron/training.py", line 138, in pretrain
    iteration = _train(args,
  File "/epfllm/Megatron-LLM/megatron/training.py", line 678, in _train
    train_step(forward_step_func,
  File "/epfllm/Megatron-LLM/megatron/training.py", line 411, in train_step
    losses_reduced = forward_backward_func(
  File "/epfllm/Megatron-LLM/megatron/schedules.py", line 234, in forward_backward_no_pipelining
    output_tensor = forward_step(forward_step_func, data_iterator,
  File "/epfllm/Megatron-LLM/megatron/schedules.py", line 117, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "/epfllm/./Megatron-LLM/finetune.py", line 213, in forward_step
    output_tensor = model(tokens, position_ids, attention_mask,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/distributed.py", line 58, in forward
    return self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/module.py", line 186, in forward
    outputs = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/gpt_model.py", line 87, in forward
    lm_output = self.language_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/language_model.py", line 512, in forward
    encoder_output = self.encoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 1239, in forward
    hidden_states = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 802, in forward
    mlp_output, mlp_bias = self.mlp(layernorm_output)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 131, in forward
    bias_gelu_impl(intermediate_parallel, bias_parallel)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 35, in forward
    return bias_gelu(bias, input)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 16, in fallback_function
@torch.jit.script
def bias_gelu(bias, y):
    x = bias + y
        ~~~~~~~~ <--- HERE
    return  x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))
RuntimeError: Expected a proper Tensor but got None (or an undefined Tensor in C++) for argument #0 'self'

kylematoba commented 10 months ago

Hi, I'm guessing that it's an OOM that's obfuscated by the JIT-ing. In cases like this I can usually recommend commenting out the @torch.jit.script decorator to get a more helpful stack trace.

As far as I can see, you've not reported what sort of model you are trying to train. Did you look at https://epfllm.github.io/Megatron-LLM/guide/faq.html#what-are-the-basic-hardware-requirements? Only the smallest models can fit into 8x A100s 80gb.

philschmid commented 10 months ago

Let me try commenting out the scripting.

I am following the getting started so its Llama 2 7B and i have 8x A100 80GBs.

that's my command

LOG_ARGS="--log_interval 1 --save_interval 100 --eval_interval 50"
TRAIN_ARGS="--train_iters 500 --lr_decay_style cosine --lr_warmup_iters 50 --lr 3e-4 --min_lr 1e-6"
DISTRIBUTED_ARGS="--nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
torchrun $DISTRIBUTED_ARGS ${MEGATRON_PATH}/finetune.py \
    --tensor_model_parallel_size 4 \
    --pipeline_model_parallel_size 1 \
    --load ${MODEL_PATH}_sharded \
    --save ${MODEL_PATH}_sharded \
    --tensorboard_dir ${MODEL_PATH}_sharded \
    --data_path ${DATASET_PATH}/megatron_text_document \
    --model_name llama2 \
    --tokenizer_type SentencePieceTokenizer \
    --vocab_file=${MODEL_PATH}/tokenizer.model \
    --bf16 \
    --use_flash_attn \
    --micro_batch_size 5 \
    --global_batch_size 1000 \
    --sequence_parallel \
    --recompute_granularity selective \
    --use_checkpoint_args \
    $COMMON_ARGS $LOG_ARGS $TRAIN_ARGS $LLAMA_ARGS

philschmid commented 10 months ago

The error is not really more helpful...

TypeError: unsupported operand type(s) for +: 'NoneType' and 'Tensor'
    encoder_output = self.encoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 1239, in forward
    hidden_states = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 802, in forward
    mlp_output, mlp_bias = self.mlp(layernorm_output)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 131, in forward
    bias_gelu_impl(intermediate_parallel, bias_parallel)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 35, in forward
    return bias_gelu(bias, input)
  File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 16, in bias_gelu
    x = bias + y
TypeError: unsupported operand type(s) for +: 'NoneType' and 'Tensor'

Should the getting started guide: https://epfllm.github.io/Megatron-LLM/guide/getting_started.html work e2e?

kylematoba commented 10 months ago

Hi, thanks for that.

I'm pretty sure the problem is something that we overlooked early on: runs without --no_bias_gelu_fusion don't work. Please can you add that argument (like is done in the docs), and let me know how you get on?

I'll make sure this bug gets investigated in any case.

philschmid commented 9 months ago

Addin --no_bias_gelu_fusion solved the issue and it is training. Thank you for your help! Will play with it more and hopefully publish a blog post on it.!

kylematoba commented 9 months ago

Thanks @philschmid. I'll close this and we'll fix the bug I mention above shortly.

epfLLM / Megatron-LLM

Getting started "shard" model not working #70