Open KOVVURISATYANARAYANAREDDY opened 1 year ago
I am trying to run the Starcoder pretraining code(/examples/pretrain_bigcode_model.slurm). I created a custom pretrain_starcoder.sh file
#!/bin/bash GPUS_PER_NODE=2 # Change for multinode config MASTER_ADDR=localhost MASTER_PORT=6000 NNODES=1 NODE_RANK=0 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES)) # File path setup CHECKPOINT_PATH=/home/jupyter/Satya/Megatron/Model_starcoder/ TOKENIZER_FILE=/home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json #WEIGHTS_TRAIN=/fsx/loubna/code/bigcode-data-mix/data/train_data_paths.txt.tmp #WEIGHTS_VALID=/fsx/loubna/code/bigcode-data-mix/data/valid_data_paths.txt.tmp mkdir -p $CHECKPOINT_PATH/tensorboard DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --master_addr $MASTER_ADDR --master_port $MASTER_PORT" GPT_ARGS="\ --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 1 \ --sequence-parallel \ --num-layers 40 \ --hidden-size 6144 \ --num-attention-heads 48 \ --attention-head-type multiquery \ --init-method-std 0.01275 \ --seq-length 8192 \ --max-position-embeddings 8192 \ --attention-dropout 0.1 \ --hidden-dropout 0.1 \ --micro-batch-size 1 \ --global-batch-size 512 \ --lr 0.0003 \ --min-lr 0.00003 \ --train-iters 250000 \ --lr-decay-iters 250000 \ --lr-decay-style cosine \ --lr-warmup-iters 2000 \ --weight-decay .1 \ --adam-beta2 .95 \ --clip-grad 1.0 \ --bf16 \ --use-flash-attn \ --fim-rate 0.5 \ --log-interval 10 \ --save-interval 2500 \ --eval-interval 2500 \ --eval-iters 2 \ --use-distributed-optimizer \ --valid-num-workers 0 \ " TENSORBOARD_ARGS="--tensorboard-dir ${CHECKPOINT_PATH}/tensorboard" export NCCL_DEBUG=INFO python -m torch.distributed.launch $DISTRIBUTED_ARGS \ pretrain_gpt.py \ $GPT_ARGS \ --tokenizer-type TokenizerFromFile \ --tokenizer-file $TOKENIZER_FILE \ --save $CHECKPOINT_PATH \ --load $CHECKPOINT_PATH \ #--train-weighted-split-paths-path $WEIGHTS_TRAIN \ #--valid-weighted-split-paths-path $WEIGHTS_VALID \ --structured-logs \ --structured-logs-dir $CHECKPOINT_PATH/logs \ $TENSORBOARD_ARGS \ --wandb-entity-name loubnabnl \ --wandb-project-name bigcode-pretraining \
i didn't set the datapath yet.
My current versions are
CUDA - 11.0 pytorch - 1.7.0 (i only found 1.7.1 and 1.7.0 for cuda 11.0). apex - 1.0 gcc --version gcc (Ubuntu 9.4.0-1ubuntu1~18.04) 9.4.0 Copyright (C) 2019 Free Software Foundation, Inc. nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Wed_Jul_22_19:09:09_PDT_2020 Cuda compilation tools, release 11.0, V11.0.221 Build cuda_11.0_bu.TC445_37.28845127_0 2 AWS A100 GPUs. nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 A100-SXM4-40GB On | 00000000:20:1C.0 Off | 0 | | N/A 24C P0 53W / 400W | 3MiB / 40537MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 A100-SXM4-40GB On | 00000000:A0:1D.0 Off | 0 | | N/A 25C P0 50W / 400W | 3MiB / 40537MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
when i run $ bash ./examples/pretrain_starcoder.sh
Wandb import failed Wandb import failed using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:TokenizerFromFile accumulate and all-reduce gradients in fp32 for bfloat16 data type. using torch.bfloat16 for parameters ... Persistent fused layer norm kernel is supported from pytorch v1.11 (nvidia pytorch container paired with v1.11). Defaulting to no_persist_layer_norm=True ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 .............. True adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.95 adam_eps ........................................ 1e-08 adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_query_key_layer_scaling ................... True apply_residual_connection_post_layernorm ........ False async_tensor_model_parallel_allreduce ........... True attention_dropout ............................... 0.1 attention_head_type ............................. multiquery attention_softmax_in_fp32 ....................... False bert_binary_head ................................ True bert_load ....................................... None bf16 ............................................ True bias_dropout_fusion ............................. True bias_gelu_fusion ................................ True biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None classes_fraction ................................ 1.0 clip_grad ....................................... 1.0 consumed_train_samples .......................... 0 consumed_valid_samples .......................... 0 data_impl ....................................... infer data_parallel_random_init ....................... False data_parallel_size .............................. 1 data_path ....................................... None data_per_class_fraction ......................... 1.0 data_sharding ................................... True dataloader_type ................................. single DDP_impl ........................................ local decoder_seq_length .............................. None dino_bottleneck_size ............................ 256 dino_freeze_last_layer .......................... 1 dino_head_hidden_size ........................... 2048 dino_local_crops_number ......................... 10 dino_local_img_size ............................. 96 dino_norm_last_layer ............................ False dino_teacher_temp ............................... 0.07 dino_warmup_teacher_temp ........................ 0.04 dino_warmup_teacher_temp_epochs ................. 30 distribute_saved_activations .................... False distributed_backend ............................. nccl distributed_timeout ............................. 600 embedding_path .................................. None empty_unused_memory_level ....................... 0 encoder_seq_length .............................. 8192 end_weight_decay ................................ 0.1 eod_mask_loss ................................... False eval_interval ................................... 2500 eval_iters ...................................... 2 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None exit_signal_handler ............................. False ffn_hidden_size ................................. 24576 fim_rate ........................................ 0.5 fim_spm_rate .................................... 0.5 finetune ........................................ False finetune_from ................................... None fp16 ............................................ False fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False global_batch_size ............................... 512 glu_activation .................................. None gradient_accumulation_fusion .................... True head_lr_mult .................................... 1.0 hidden_dropout .................................. 0.1 hidden_size ..................................... 6144 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_h ........................................... 224 img_w ........................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 inference_batch_times_seqlen_threshold .......... 512 init_method_std ................................. 0.01275 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 iter_per_epoch .................................. 1250 kv_channels ..................................... 128 layernorm_epsilon ............................... 1e-05 lazy_mpu_init ................................... None load ............................................ /home/jupyter/Satya/Megatron/Model_starcoder/ local_rank ...................................... 0 log_batch_size_to_tensorboard ................... False log_interval .................................... 10 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_memory_to_tensorboard ....................... False log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False log_world_size_to_tensorboard ................... False loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. 0.0003 lr_decay_iters .................................. 250000 lr_decay_samples ................................ None lr_decay_style .................................. cosine lr_warmup_fraction .............................. None lr_warmup_iters ................................. 2000 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 128 mask_factor ..................................... 1.0 mask_prob ....................................... 0.15 mask_type ....................................... random masked_softmax_fusion ........................... True max_position_embeddings ......................... 8192 merge_file ...................................... None micro_batch_size ................................ 1 min_loss_scale .................................. 1.0 min_lr .......................................... 3e-05 mmap_warmup ..................................... False no_load_optim ................................... None no_load_rng ..................................... None no_persist_layer_norm ........................... True no_save_optim ................................... None no_save_rng ..................................... None num_attention_heads ............................. 48 num_channels .................................... 3 num_classes ..................................... 1000 num_experts ..................................... None num_layers ...................................... 40 num_layers_per_virtual_pipeline_stage ........... None num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False optimizer ....................................... adam override_opt_param_scheduler .................... False params_dtype .................................... torch.bfloat16 patch_dim ....................................... 16 perform_initialization .......................... True pipeline_model_parallel_size .................... 1 pipeline_model_parallel_split_rank .............. None position_embedding_type ......................... PositionEmbeddingType.absolute query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 recompute_granularity ........................... None recompute_method ................................ None recompute_num_layers ............................ 1 reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 sample_rate ..................................... 1.0 save ............................................ /home/jupyter/Satya/Megatron/Model_starcoder/ save_interval ................................... 2500 scatter_gather_tensors_in_pipeline .............. True seed ............................................ 1234 seq_length ...................................... 8192 sequence_parallel ............................... False sgd_momentum .................................... 0.9 short_seq_prob .................................. 0.1 split ........................................... None standalone_embedding_stage ...................... False start_weight_decay .............................. 0.1 structured_logs ................................. False structured_logs_dir ............................. None swin_backbone_type .............................. tiny tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. None tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 test_weighted_split_paths ....................... None test_weighted_split_paths_path .................. None titles_data_path ................................ None tokenizer_file .................................. /home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json tokenizer_type .................................. TokenizerFromFile train_iters ..................................... 250000 train_samples ................................... None train_weighted_split_paths ...................... None train_weighted_split_paths_path ................. None transformer_pipeline_model_parallel_size ........ 1 transformer_timers .............................. False use_checkpoint_args ............................. False use_checkpoint_opt_param_scheduler .............. False use_contiguous_buffers_in_local_ddp ............. True use_cpu_initialization .......................... None use_distributed_optimizer ....................... True use_flash_attn .................................. True use_one_sent_docs ............................... False valid_num_workers ............................... 0 valid_weighted_split_paths ...................... None valid_weighted_split_paths_path ................. None virtual_pipeline_model_parallel_size ............ None vision_backbone_type ............................ vit vision_pretraining .............................. False vision_pretraining_type ......................... classify vocab_extra_ids ................................. 0 vocab_file ...................................... None wandb_entity_name ............................... None wandb_project_name .............................. None weight_decay .................................... 0.1 weight_decay_incr_style ......................... constant world_size ...................................... 1 -------------------- end of arguments --------------------- setting number of micro-batches to constant 512 > building TokenizerFromFile tokenizer ... > padded vocab (size: 49152) with 0 dummy tokens (new size: 49152) 05:15:56.69 >>> Call to _initialize_distributed in File "/tmp/Megatron/megatron/initialize.py", line 220 05:15:56.69 220 | def _initialize_distributed(): 05:15:56.69 222 | args = get_args() 05:15:56.69 .......... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1) 05:15:56.69 224 | device_count = torch.cuda.device_count() 05:15:56.69 .......... device_count = 2 05:15:56.69 225 | if torch.distributed.is_initialized(): 05:15:56.69 235 | if args.rank == 0: 05:15:56.69 236 | print('> initializing torch distributed ...', flush=True) > initializing torch distributed ... 05:15:56.69 238 | if device_count > 0: 05:15:56.69 239 | device = args.rank % device_count 05:15:56.69 .................. device = 0 05:15:56.69 240 | if args.local_rank is not None: 05:15:56.69 241 | assert args.local_rank == device, \ 05:15:56.69 245 | torch.cuda.set_device(device) 05:15:56.70 249 | torch.distributed.init_process_group( 05:15:56.70 250 | backend="gloo",#args.distributed_backend, 05:15:56.70 251 | world_size=args.world_size, rank=args.rank, 05:15:56.70 252 | timeout=timedelta(seconds=args.distributed_timeout)) 05:15:56.70 249 | torch.distributed.init_process_group( 05:15:56.70 256 | if device_count > 0: 05:15:56.70 257 | if mpu.model_parallel_is_initialized(): 05:15:56.70 260 | mpu.initialize_model_parallel(args.tensor_model_parallel_size, 05:15:56.70 261 | args.pipeline_model_parallel_size, 05:15:56.70 262 | args.virtual_pipeline_model_parallel_size, 05:15:56.70 263 | args.pipeline_model_parallel_split_rank) 05:15:56.70 260 | mpu.initialize_model_parallel(args.tensor_model_parallel_size, > initializing tensor model parallel with size 1 > initializing pipeline model parallel with size 1 05:15:56.70 <<< Return value from _initialize_distributed: None > setting random seeds to 1234 ... > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 05:15:56.70 >>> Call to _compile_dependencies in File "/tmp/Megatron/megatron/initialize.py", line 160 05:15:56.70 160 | def _compile_dependencies(): 05:15:56.70 162 | args = get_args() 05:15:56.73 >>> Call to get_args in File "/tmp/Megatron/megatron/global_vars.py", line 38 05:15:56.73 38 | def get_args(): 05:15:56.73 40 | _ensure_var_is_initialized(_GLOBAL_ARGS, 'args') 05:15:56.73 41 | return _GLOBAL_ARGS 05:15:56.73 <<< Return value from get_args: Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1) 05:15:56.73 162 | args = get_args() 05:15:56.73 .......... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1) 05:15:56.73 168 | if torch.distributed.get_rank() == 0: 05:15:56.84 >>> Call to get_rank in File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 584 05:15:56.84 ...... group = <object object at 0x7fe25503e6c0> 05:15:56.84 584 | def get_rank(group=group.WORLD): 05:15:56.84 600 | if _rank_not_in_group(group): 05:15:56.84 603 | _check_default_pg() 05:15:56.84 604 | if group == GroupMember.WORLD: 05:15:56.84 605 | return _default_pg.rank() 05:15:56.84 <<< Return value from get_rank: 0 05:15:56.84 168 | if torch.distributed.get_rank() == 0: 05:15:56.84 169 | start_time = time.time() 05:15:56.84 .............. start_time = 1686719756.846662 05:15:56.84 170 | print('> compiling dataset index builder ...') > compiling dataset index builder ... 05:15:56.84 171 | from megatron.data.dataset_utils import compile_helper 05:15:56.84 .............. compile_helper = <function compile_helper at 0x7fe24b749280> 05:15:56.84 172 | compile_helper() 05:15:56.92 >>> Call to compile_helper in File "/tmp/Megatron/megatron/data/dataset_utils.py", line 81 05:15:56.92 81 | def compile_helper(): 05:15:56.92 84 | import os 05:15:56.92 .......... os = <module 'os' from '/opt/conda/envs/starcoder/lib/python3.8/os.py'> 05:15:56.92 85 | import subprocess 05:15:56.92 .......... subprocess = <module 'subprocess' from '/opt/conda/envs/starcoder/lib/python3.8/subprocess.py'> 05:15:56.92 86 | path = os.path.abspath(os.path.dirname(__file__)) 05:15:56.92 .......... path = '/tmp/Megatron/megatron/data' 05:15:56.92 87 | ret = subprocess.run(['make', '-C', path]) make: Entering directory '/tmp/Megatron/megatron/data' make: Nothing to be done for 'default'. make: Leaving directory '/tmp/Megatron/megatron/data' 05:15:56.96 .......... ret = CompletedProcess(args=['make', '-C', '/tmp/Megatron/megatron/data'], returncode=0) 05:15:56.96 88 | if ret.returncode != 0: 05:15:56.96 <<< Return value from compile_helper: None 05:15:56.96 172 | compile_helper() 05:15:56.96 173 | print('>>> done with dataset index builder. Compilation time: {:.3f} ' 05:15:56.96 174 | 'seconds'.format(time.time() - start_time), flush=True) 05:15:56.96 173 | print('>>> done with dataset index builder. Compilation time: {:.3f} ' 05:15:56.96 174 | 'seconds'.format(time.time() - start_time), flush=True) 05:15:56.96 173 | print('>>> done with dataset index builder. Compilation time: {:.3f} ' >>> done with dataset index builder. Compilation time: 0.114 seconds 05:15:56.96 181 | seq_len = args.seq_length 05:15:56.96 .......... seq_len = 8192 05:15:56.96 182 | attn_batch_size = \ 05:15:56.96 183 | (args.num_attention_heads / args.tensor_model_parallel_size) * \ 05:15:56.96 184 | args.micro_batch_size 05:15:56.96 183 | (args.num_attention_heads / args.tensor_model_parallel_size) * \ 05:15:56.96 182 | attn_batch_size = \ 05:15:56.96 .......... attn_batch_size = 48.0 05:15:56.96 187 | custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \ 05:15:56.96 188 | seq_len % 4 == 0 and attn_batch_size % 4 == 0 05:15:56.96 187 | custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \ 05:15:56.96 188 | seq_len % 4 == 0 and attn_batch_size % 4 == 0 05:15:56.96 187 | custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \ 05:15:56.96 .......... custom_kernel_constraint = True 05:15:56.96 190 | if not ((args.fp16 or args.bf16) and 05:15:56.96 191 | custom_kernel_constraint and 05:15:56.96 190 | if not ((args.fp16 or args.bf16) and 05:15:56.96 192 | args.masked_softmax_fusion): 05:15:56.96 190 | if not ((args.fp16 or args.bf16) and 05:15:56.96 199 | if torch.distributed.get_rank() == 0: 05:15:56.96 >>> Call to get_rank in File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 584 05:15:56.96 ...... group = <object object at 0x7fe25503e6c0> 05:15:56.96 584 | def get_rank(group=group.WORLD): 05:15:56.96 600 | if _rank_not_in_group(group): 05:15:56.96 603 | _check_default_pg() 05:15:56.96 604 | if group == GroupMember.WORLD: 05:15:56.96 605 | return _default_pg.rank() 05:15:56.96 <<< Return value from get_rank: 0 05:15:56.96 199 | if torch.distributed.get_rank() == 0: 05:15:56.96 200 | start_time = time.time() 05:15:56.96 .............. start_time = 1686719756.9662645 05:15:56.96 201 | print('> compiling and loading fused kernels ...', flush=True) > compiling and loading fused kernels ... 05:15:56.96 202 | fused_kernels.load(args) 05:15:56.96 >>> Call to load in File "/tmp/Megatron/megatron/fused_kernels/__init__.py", line 4 05:15:56.96 ...... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1) 05:15:56.96 4 | def load(args): 05:15:56.96 5 | if torch.version.hip is None: 05:15:56.96 6 | print("running on CUDA devices") running on CUDA devices 05:15:56.96 7 | from megatron.fused_kernels.cuda import load as load_kernels 05:15:58.87 .............. load_kernels = <function load at 0x7fe2422201f0> 05:15:58.87 12 | load_kernels(args) Detected CUDA files, patching ldflags Emitting ninja build file /tmp/Megatron/megatron/fused_kernels/cuda/build/build.ninja... Building extension module scaled_upper_triang_masked_softmax_cuda... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o FAILED: scaled_upper_triang_masked_softmax_cuda.cuda.o /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list argument types are: (const char *const) detected during: instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<const char *const &>]" /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(1375): here instantiation of "__nv_bool pybind11::detail::object_api<Derived>::contains(T &&) const [with Derived=pybind11::handle, T=const char *const &]" /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/detail/internals.h(176): here /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<>]" /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(201): here /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list argument types are: (pybind11::handle, pybind11::handle) detected during: instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::handle &, pybind11::handle &>]" /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(923): here instantiation of "pybind11::str pybind11::str::format(Args &&...) const [with Args=<pybind11::handle &, pybind11::handle &>]" /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(755): here /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list argument types are: (pybind11::handle, pybind11::handle, pybind11::none, pybind11::str) detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::handle, pybind11::handle, pybind11::none, pybind11::str>]" /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(971): here /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list argument types are: (pybind11::object, const pybind11::handle) detected during: instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object &, const pybind11::handle &>]" /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(923): here instantiation of "pybind11::str pybind11::str::format(Args &&...) const [with Args=<pybind11::object &, const pybind11::handle &>]" /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1401): here /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list argument types are: (pybind11::cpp_function) detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::cpp_function>]" /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1407): here /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list argument types are: (pybind11::cpp_function, pybind11::none, pybind11::none, const char [1]) detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::cpp_function, pybind11::none, pybind11::none, const char (&)[1]>]" /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1418): here /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list argument types are: (pybind11::tuple) detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::tuple &>]" /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1812): here /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list argument types are: (pybind11::object) detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object &>]" /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1830): here /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list argument types are: (pybind11::object) detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object>]" /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1831): here 10 errors detected in the compilation of "/tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu". ninja: build stopped: subcommand failed. 05:16:05.35 !!! RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda' 05:16:05.35 !!! When calling: load_kernels(args) 05:16:05.35 !!! Call ended by exception 05:16:05.35 202 | fused_kernels.load(args) 05:16:05.39 !!! RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda' 05:16:05.39 !!! When calling: fused_kernels.load(args) 05:16:05.39 !!! Call ended by exception Traceback (most recent call last): File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1516, in _run_ninja_build subprocess.run( File "/opt/conda/envs/starcoder/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "pretrain_gpt.py", line 158, in <module> pretrain(train_valid_test_datasets_provider, model_provider, File "/tmp/Megatron/megatron/training.py", line 107, in pretrain initialize_megatron(extra_args_provider=extra_args_provider, File "/tmp/Megatron/megatron/initialize.py", line 106, in initialize_megatron _compile_dependencies() File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/snoop/tracer.py", line 173, in simple_wrapper return function(*args, **kwargs) File "/tmp/Megatron/megatron/initialize.py", line 202, in _compile_dependencies fused_kernels.load(args) File "/tmp/Megatron/megatron/fused_kernels/__init__.py", line 12, in load load_kernels(args) File "/tmp/Megatron/megatron/fused_kernels/cuda/__init__.py", line 70, in load scaled_upper_triang_masked_softmax_cuda = _cpp_extention_load_helper( File "/tmp/Megatron/megatron/fused_kernels/cuda/__init__.py", line 42, in _cpp_extention_load_helper return cpp_extension.load( File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 969, in load return _jit_compile( File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1176, in _jit_compile _write_ninja_file_and_build_library( File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1280, in _write_ninja_file_and_build_library _run_ninja_build( File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1538, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda' Traceback (most recent call last): File "/opt/conda/envs/starcoder/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/envs/starcoder/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module> main() File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['/opt/conda/envs/starcoder/bin/python', '-u', 'pretrain_gpt.py', '--local_rank=0', '--tensor-model-parallel-size', '1', '--pipeline-model-parallel-size', '1', '--num-layers', '40', '--hidden-size', '6144', '--num-attention-heads', '48', '--attention-head-type', 'multiquery', '--init-method-std', '0.01275', '--seq-length', '8192', '--max-position-embeddings', '8192', '--attention-dropout', '0.1', '--hidden-dropout', '0.1', '--micro-batch-size', '1', '--global-batch-size', '512', '--lr', '0.0003', '--min-lr', '0.00003', '--train-iters', '250000', '--lr-decay-iters', '250000', '--lr-decay-style', 'cosine', '--lr-warmup-iters', '2000', '--weight-decay', '.1', '--adam-beta2', '.95', '--clip-grad', '1.0', '--bf16', '--use-flash-attn', '--fim-rate', '0.5', '--log-interval', '10', '--save-interval', '2500', '--eval-interval', '2500', '--eval-iters', '2', '--use-distributed-optimizer', '--valid-num-workers', '0', '--tokenizer-type', 'TokenizerFromFile', '--tokenizer-file', '/home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json', '--save', '/home/jupyter/Satya/Megatron/Model_starcoder/', '--load', '/home/jupyter/Satya/Megatron/Model_starcoder/']' returned non-zero exit status 1. examples/pretrain_starcoder.sh: line 75: --structured-logs: command not found
in the above code i also tried using snoop trace. Below is the main error.
Detected CUDA files, patching ldflags Emitting ninja build file /tmp/Megatron/megatron/fused_kernels/cuda/build/build.ninja... Building extension module scaled_upper_triang_masked_softmax_cuda... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o FAILED: scaled_upper_triang_masked_softmax_cuda.cuda.o
I am trying to run the Starcoder pretraining code(/examples/pretrain_bigcode_model.slurm). I created a custom pretrain_starcoder.sh file
i didn't set the datapath yet.
My current versions are
when i run $ bash ./examples/pretrain_starcoder.sh
in the above code i also tried using snoop trace. Below is the main error.