RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'

I am trying to run the Starcoder pretraining code(/examples/pretrain_bigcode_model.slurm). I created a custom pretrain_starcoder.sh file

      #!/bin/bash

      GPUS_PER_NODE=2
      # Change for multinode config
      MASTER_ADDR=localhost
      MASTER_PORT=6000
      NNODES=1
      NODE_RANK=0
      WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

      # File path setup
      CHECKPOINT_PATH=/home/jupyter/Satya/Megatron/Model_starcoder/
      TOKENIZER_FILE=/home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json
      #WEIGHTS_TRAIN=/fsx/loubna/code/bigcode-data-mix/data/train_data_paths.txt.tmp
      #WEIGHTS_VALID=/fsx/loubna/code/bigcode-data-mix/data/valid_data_paths.txt.tmp

      mkdir -p $CHECKPOINT_PATH/tensorboard

      DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

      GPT_ARGS="\
             --tensor-model-parallel-size 1 \
             --pipeline-model-parallel-size 1 \
             --sequence-parallel \
             --num-layers 40 \
             --hidden-size 6144 \
             --num-attention-heads 48 \
             --attention-head-type multiquery \
             --init-method-std 0.01275 \
             --seq-length 8192 \
             --max-position-embeddings 8192 \
             --attention-dropout 0.1 \
             --hidden-dropout 0.1 \
             --micro-batch-size 1 \
             --global-batch-size 512 \
             --lr 0.0003 \
             --min-lr 0.00003 \
             --train-iters 250000 \
             --lr-decay-iters 250000 \
             --lr-decay-style cosine \
             --lr-warmup-iters 2000 \
             --weight-decay .1 \
             --adam-beta2 .95 \
             --clip-grad 1.0 \
             --bf16 \
             --use-flash-attn \
             --fim-rate 0.5 \
             --log-interval 10 \
             --save-interval 2500 \
             --eval-interval 2500 \
             --eval-iters 2 \
             --use-distributed-optimizer \
             --valid-num-workers 0 \
      "

      TENSORBOARD_ARGS="--tensorboard-dir ${CHECKPOINT_PATH}/tensorboard"

      export NCCL_DEBUG=INFO
      python -m torch.distributed.launch $DISTRIBUTED_ARGS \
              pretrain_gpt.py \
              $GPT_ARGS \
          --tokenizer-type TokenizerFromFile \
          --tokenizer-file $TOKENIZER_FILE \
          --save $CHECKPOINT_PATH \
          --load $CHECKPOINT_PATH \
          #--train-weighted-split-paths-path $WEIGHTS_TRAIN \
          #--valid-weighted-split-paths-path $WEIGHTS_VALID \
          --structured-logs \
          --structured-logs-dir $CHECKPOINT_PATH/logs \
          $TENSORBOARD_ARGS \
          --wandb-entity-name loubnabnl \
          --wandb-project-name bigcode-pretraining \

i didn't set the datapath yet.

My current versions are

 CUDA - 11.0
 pytorch - 1.7.0 (i only found 1.7.1 and 1.7.0 for cuda 11.0).
 apex - 1.0
  gcc --version
      gcc (Ubuntu 9.4.0-1ubuntu1~18.04) 9.4.0
      Copyright (C) 2019 Free Software Foundation, Inc.
 nvcc --version
      nvcc: NVIDIA (R) Cuda compiler driver
      Copyright (c) 2005-2020 NVIDIA Corporation
      Built on Wed_Jul_22_19:09:09_PDT_2020
      Cuda compilation tools, release 11.0, V11.0.221
      Build cuda_11.0_bu.TC445_37.28845127_0
  2 AWS A100 GPUs.
 nvidia-smi
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
      |-------------------------------+----------------------+----------------------+
      | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
      |                               |                      |               MIG M. |
      |===============================+======================+======================|
      |   0  A100-SXM4-40GB      On   | 00000000:20:1C.0 Off |                    0 |
      | N/A   24C    P0    53W / 400W |      3MiB / 40537MiB |      0%      Default |
      |                               |                      |             Disabled |
      +-------------------------------+----------------------+----------------------+
      |   1  A100-SXM4-40GB      On   | 00000000:A0:1D.0 Off |                    0 |
      | N/A   25C    P0    50W / 400W |      3MiB / 40537MiB |      0%      Default |
      |                               |                      |             Disabled |
      +-------------------------------+----------------------+----------------------+

when i run $ bash ./examples/pretrain_starcoder.sh

          Wandb import failed
          Wandb import failed
          using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
          WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:TokenizerFromFile
          accumulate and all-reduce gradients in fp32 for bfloat16 data type.
          using torch.bfloat16 for parameters ...
          Persistent fused layer norm kernel is supported from pytorch v1.11 (nvidia pytorch container paired with v1.11). Defaulting to no_persist_layer_norm=True
          ------------------------ arguments ------------------------
            accumulate_allreduce_grads_in_fp32 .............. True
            adam_beta1 ...................................... 0.9
            adam_beta2 ...................................... 0.95
            adam_eps ........................................ 1e-08
            adlr_autoresume ................................. False
            adlr_autoresume_interval ........................ 1000
            apply_query_key_layer_scaling ................... True
            apply_residual_connection_post_layernorm ........ False
            async_tensor_model_parallel_allreduce ........... True
            attention_dropout ............................... 0.1
            attention_head_type ............................. multiquery
            attention_softmax_in_fp32 ....................... False
            bert_binary_head ................................ True
            bert_load ....................................... None
            bf16 ............................................ True
            bias_dropout_fusion ............................. True
            bias_gelu_fusion ................................ True
            biencoder_projection_dim ........................ 0
            biencoder_shared_query_context_model ............ False
            block_data_path ................................. None
            classes_fraction ................................ 1.0
            clip_grad ....................................... 1.0
            consumed_train_samples .......................... 0
            consumed_valid_samples .......................... 0
            data_impl ....................................... infer
            data_parallel_random_init ....................... False
            data_parallel_size .............................. 1
            data_path ....................................... None
            data_per_class_fraction ......................... 1.0
            data_sharding ................................... True
            dataloader_type ................................. single
            DDP_impl ........................................ local
            decoder_seq_length .............................. None
            dino_bottleneck_size ............................ 256
            dino_freeze_last_layer .......................... 1
            dino_head_hidden_size ........................... 2048
            dino_local_crops_number ......................... 10
            dino_local_img_size ............................. 96
            dino_norm_last_layer ............................ False
            dino_teacher_temp ............................... 0.07
            dino_warmup_teacher_temp ........................ 0.04
            dino_warmup_teacher_temp_epochs ................. 30
            distribute_saved_activations .................... False
            distributed_backend ............................. nccl
            distributed_timeout ............................. 600
            embedding_path .................................. None
            empty_unused_memory_level ....................... 0
            encoder_seq_length .............................. 8192
            end_weight_decay ................................ 0.1
            eod_mask_loss ................................... False
            eval_interval ................................... 2500
            eval_iters ...................................... 2
            evidence_data_path .............................. None
            exit_duration_in_mins ........................... None
            exit_interval ................................... None
            exit_signal_handler ............................. False
            ffn_hidden_size ................................. 24576
            fim_rate ........................................ 0.5
            fim_spm_rate .................................... 0.5
            finetune ........................................ False
            finetune_from ................................... None
            fp16 ............................................ False
            fp16_lm_cross_entropy ........................... False
            fp32_residual_connection ........................ False
            global_batch_size ............................... 512
            glu_activation .................................. None
            gradient_accumulation_fusion .................... True
            head_lr_mult .................................... 1.0
            hidden_dropout .................................. 0.1
            hidden_size ..................................... 6144
            hysteresis ...................................... 2
            ict_head_size ................................... None
            ict_load ........................................ None
            img_h ........................................... 224
            img_w ........................................... 224
            indexer_batch_size .............................. 128
            indexer_log_interval ............................ 1000
            inference_batch_times_seqlen_threshold .......... 512
            init_method_std ................................. 0.01275
            init_method_xavier_uniform ...................... False
            initial_loss_scale .............................. 4294967296
            iter_per_epoch .................................. 1250
            kv_channels ..................................... 128
            layernorm_epsilon ............................... 1e-05
            lazy_mpu_init ................................... None
            load ............................................ /home/jupyter/Satya/Megatron/Model_starcoder/
            local_rank ...................................... 0
            log_batch_size_to_tensorboard ................... False
            log_interval .................................... 10
            log_learning_rate_to_tensorboard ................ True
            log_loss_scale_to_tensorboard ................... True
            log_memory_to_tensorboard ....................... False
            log_num_zeros_in_grad ........................... False
            log_params_norm ................................. False
            log_timers_to_tensorboard ....................... False
            log_validation_ppl_to_tensorboard ............... False
            log_world_size_to_tensorboard ................... False
            loss_scale ...................................... None
            loss_scale_window ............................... 1000
            lr .............................................. 0.0003
            lr_decay_iters .................................. 250000
            lr_decay_samples ................................ None
            lr_decay_style .................................. cosine
            lr_warmup_fraction .............................. None
            lr_warmup_iters ................................. 2000
            lr_warmup_samples ............................... 0
            make_vocab_size_divisible_by .................... 128
            mask_factor ..................................... 1.0
            mask_prob ....................................... 0.15
            mask_type ....................................... random
            masked_softmax_fusion ........................... True
            max_position_embeddings ......................... 8192
            merge_file ...................................... None
            micro_batch_size ................................ 1
            min_loss_scale .................................. 1.0
            min_lr .......................................... 3e-05
            mmap_warmup ..................................... False
            no_load_optim ................................... None
            no_load_rng ..................................... None
            no_persist_layer_norm ........................... True
            no_save_optim ................................... None
            no_save_rng ..................................... None
            num_attention_heads ............................. 48
            num_channels .................................... 3
            num_classes ..................................... 1000
            num_experts ..................................... None
            num_layers ...................................... 40
            num_layers_per_virtual_pipeline_stage ........... None
            num_workers ..................................... 2
            onnx_safe ....................................... None
            openai_gelu ..................................... False
            optimizer ....................................... adam
            override_opt_param_scheduler .................... False
            params_dtype .................................... torch.bfloat16
            patch_dim ....................................... 16
            perform_initialization .......................... True
            pipeline_model_parallel_size .................... 1
            pipeline_model_parallel_split_rank .............. None
            position_embedding_type ......................... PositionEmbeddingType.absolute
            query_in_block_prob ............................. 0.1
            rampup_batch_size ............................... None
            rank ............................................ 0
            recompute_granularity ........................... None
            recompute_method ................................ None
            recompute_num_layers ............................ 1
            reset_attention_mask ............................ False
            reset_position_ids .............................. False
            retriever_report_topk_accuracies ................ []
            retriever_score_scaling ......................... False
            retriever_seq_length ............................ 256
            sample_rate ..................................... 1.0
            save ............................................ /home/jupyter/Satya/Megatron/Model_starcoder/
            save_interval ................................... 2500
            scatter_gather_tensors_in_pipeline .............. True
            seed ............................................ 1234
            seq_length ...................................... 8192
            sequence_parallel ............................... False
            sgd_momentum .................................... 0.9
            short_seq_prob .................................. 0.1
            split ........................................... None
            standalone_embedding_stage ...................... False
            start_weight_decay .............................. 0.1
            structured_logs ................................. False
            structured_logs_dir ............................. None
            swin_backbone_type .............................. tiny
            tensor_model_parallel_size ...................... 1
            tensorboard_dir ................................. None
            tensorboard_log_interval ........................ 1
            tensorboard_queue_size .......................... 1000
            test_weighted_split_paths ....................... None
            test_weighted_split_paths_path .................. None
            titles_data_path ................................ None
            tokenizer_file .................................. /home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json
            tokenizer_type .................................. TokenizerFromFile
            train_iters ..................................... 250000
            train_samples ................................... None
            train_weighted_split_paths ...................... None
            train_weighted_split_paths_path ................. None
            transformer_pipeline_model_parallel_size ........ 1
            transformer_timers .............................. False
            use_checkpoint_args ............................. False
            use_checkpoint_opt_param_scheduler .............. False
            use_contiguous_buffers_in_local_ddp ............. True
            use_cpu_initialization .......................... None
            use_distributed_optimizer ....................... True
            use_flash_attn .................................. True
            use_one_sent_docs ............................... False
            valid_num_workers ............................... 0
            valid_weighted_split_paths ...................... None
            valid_weighted_split_paths_path ................. None
            virtual_pipeline_model_parallel_size ............ None
            vision_backbone_type ............................ vit
            vision_pretraining .............................. False
            vision_pretraining_type ......................... classify
            vocab_extra_ids ................................. 0
            vocab_file ...................................... None
            wandb_entity_name ............................... None
            wandb_project_name .............................. None
            weight_decay .................................... 0.1
            weight_decay_incr_style ......................... constant
            world_size ...................................... 1
          -------------------- end of arguments ---------------------
          setting number of micro-batches to constant 512
          > building TokenizerFromFile tokenizer ...
           > padded vocab (size: 49152) with 0 dummy tokens (new size: 49152)
          05:15:56.69 >>> Call to _initialize_distributed in File "/tmp/Megatron/megatron/initialize.py", line 220
          05:15:56.69  220 | def _initialize_distributed():
          05:15:56.69  222 |     args = get_args()
          05:15:56.69 .......... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
          05:15:56.69  224 |     device_count = torch.cuda.device_count()
          05:15:56.69 .......... device_count = 2
          05:15:56.69  225 |     if torch.distributed.is_initialized():
          05:15:56.69  235 |         if args.rank == 0:
          05:15:56.69  236 |             print('> initializing torch distributed ...', flush=True)
          > initializing torch distributed ...
          05:15:56.69  238 |         if device_count > 0:
          05:15:56.69  239 |             device = args.rank % device_count
          05:15:56.69 .................. device = 0
          05:15:56.69  240 |             if args.local_rank is not None:
          05:15:56.69  241 |                 assert args.local_rank == device, \
          05:15:56.69  245 |             torch.cuda.set_device(device)
          05:15:56.70  249 |         torch.distributed.init_process_group(
          05:15:56.70  250 |             backend="gloo",#args.distributed_backend,
          05:15:56.70  251 |             world_size=args.world_size, rank=args.rank,
          05:15:56.70  252 |             timeout=timedelta(seconds=args.distributed_timeout))
          05:15:56.70  249 |         torch.distributed.init_process_group(
          05:15:56.70  256 |     if device_count > 0:
          05:15:56.70  257 |         if mpu.model_parallel_is_initialized():
          05:15:56.70  260 |             mpu.initialize_model_parallel(args.tensor_model_parallel_size,
          05:15:56.70  261 |                                           args.pipeline_model_parallel_size,
          05:15:56.70  262 |                                           args.virtual_pipeline_model_parallel_size,
          05:15:56.70  263 |                                           args.pipeline_model_parallel_split_rank)
          05:15:56.70  260 |             mpu.initialize_model_parallel(args.tensor_model_parallel_size,
          > initializing tensor model parallel with size 1
          > initializing pipeline model parallel with size 1
          05:15:56.70 <<< Return value from _initialize_distributed: None
          > setting random seeds to 1234 ...
          > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
          05:15:56.70 >>> Call to _compile_dependencies in File "/tmp/Megatron/megatron/initialize.py", line 160
          05:15:56.70  160 | def _compile_dependencies():
          05:15:56.70  162 |     args = get_args()
              05:15:56.73 >>> Call to get_args in File "/tmp/Megatron/megatron/global_vars.py", line 38
              05:15:56.73   38 | def get_args():
              05:15:56.73   40 |     _ensure_var_is_initialized(_GLOBAL_ARGS, 'args')
              05:15:56.73   41 |     return _GLOBAL_ARGS
              05:15:56.73 <<< Return value from get_args: Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
          05:15:56.73  162 |     args = get_args()
          05:15:56.73 .......... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
          05:15:56.73  168 |     if torch.distributed.get_rank() == 0:
              05:15:56.84 >>> Call to get_rank in File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 584
              05:15:56.84 ...... group = <object object at 0x7fe25503e6c0>
              05:15:56.84  584 | def get_rank(group=group.WORLD):
              05:15:56.84  600 |     if _rank_not_in_group(group):
              05:15:56.84  603 |     _check_default_pg()
              05:15:56.84  604 |     if group == GroupMember.WORLD:
              05:15:56.84  605 |         return _default_pg.rank()
              05:15:56.84 <<< Return value from get_rank: 0
          05:15:56.84  168 |     if torch.distributed.get_rank() == 0:
          05:15:56.84  169 |         start_time = time.time()
          05:15:56.84 .............. start_time = 1686719756.846662
          05:15:56.84  170 |         print('> compiling dataset index builder ...')
          > compiling dataset index builder ...
          05:15:56.84  171 |         from megatron.data.dataset_utils import compile_helper
          05:15:56.84 .............. compile_helper = <function compile_helper at 0x7fe24b749280>
          05:15:56.84  172 |         compile_helper()
              05:15:56.92 >>> Call to compile_helper in File "/tmp/Megatron/megatron/data/dataset_utils.py", line 81
              05:15:56.92   81 | def compile_helper():
              05:15:56.92   84 |     import os
              05:15:56.92 .......... os = <module 'os' from '/opt/conda/envs/starcoder/lib/python3.8/os.py'>
              05:15:56.92   85 |     import subprocess
              05:15:56.92 .......... subprocess = <module 'subprocess' from '/opt/conda/envs/starcoder/lib/python3.8/subprocess.py'>
              05:15:56.92   86 |     path = os.path.abspath(os.path.dirname(__file__))
              05:15:56.92 .......... path = '/tmp/Megatron/megatron/data'
              05:15:56.92   87 |     ret = subprocess.run(['make', '-C', path])
          make: Entering directory '/tmp/Megatron/megatron/data'
          make: Nothing to be done for 'default'.
          make: Leaving directory '/tmp/Megatron/megatron/data'
              05:15:56.96 .......... ret = CompletedProcess(args=['make', '-C', '/tmp/Megatron/megatron/data'], returncode=0)
              05:15:56.96   88 |     if ret.returncode != 0:
              05:15:56.96 <<< Return value from compile_helper: None
          05:15:56.96  172 |         compile_helper()
          05:15:56.96  173 |         print('>>> done with dataset index builder. Compilation time: {:.3f} '
          05:15:56.96  174 |               'seconds'.format(time.time() - start_time), flush=True)
          05:15:56.96  173 |         print('>>> done with dataset index builder. Compilation time: {:.3f} '
          05:15:56.96  174 |               'seconds'.format(time.time() - start_time), flush=True)
          05:15:56.96  173 |         print('>>> done with dataset index builder. Compilation time: {:.3f} '
          >>> done with dataset index builder. Compilation time: 0.114 seconds
          05:15:56.96  181 |     seq_len = args.seq_length
          05:15:56.96 .......... seq_len = 8192
          05:15:56.96  182 |     attn_batch_size = \
          05:15:56.96  183 |         (args.num_attention_heads / args.tensor_model_parallel_size) * \
          05:15:56.96  184 |         args.micro_batch_size
          05:15:56.96  183 |         (args.num_attention_heads / args.tensor_model_parallel_size) * \
          05:15:56.96  182 |     attn_batch_size = \
          05:15:56.96 .......... attn_batch_size = 48.0
          05:15:56.96  187 |     custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \
          05:15:56.96  188 |         seq_len % 4 == 0 and attn_batch_size % 4 == 0
          05:15:56.96  187 |     custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \
          05:15:56.96  188 |         seq_len % 4 == 0 and attn_batch_size % 4 == 0
          05:15:56.96  187 |     custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \
          05:15:56.96 .......... custom_kernel_constraint = True
          05:15:56.96  190 |     if not ((args.fp16 or args.bf16) and
          05:15:56.96  191 |             custom_kernel_constraint and
          05:15:56.96  190 |     if not ((args.fp16 or args.bf16) and
          05:15:56.96  192 |             args.masked_softmax_fusion):
          05:15:56.96  190 |     if not ((args.fp16 or args.bf16) and
          05:15:56.96  199 |     if torch.distributed.get_rank() == 0:
              05:15:56.96 >>> Call to get_rank in File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 584
              05:15:56.96 ...... group = <object object at 0x7fe25503e6c0>
              05:15:56.96  584 | def get_rank(group=group.WORLD):
              05:15:56.96  600 |     if _rank_not_in_group(group):
              05:15:56.96  603 |     _check_default_pg()
              05:15:56.96  604 |     if group == GroupMember.WORLD:
              05:15:56.96  605 |         return _default_pg.rank()
              05:15:56.96 <<< Return value from get_rank: 0
          05:15:56.96  199 |     if torch.distributed.get_rank() == 0:
          05:15:56.96  200 |         start_time = time.time()
          05:15:56.96 .............. start_time = 1686719756.9662645
          05:15:56.96  201 |         print('> compiling and loading fused kernels ...', flush=True)
          > compiling and loading fused kernels ...
          05:15:56.96  202 |         fused_kernels.load(args)
              05:15:56.96 >>> Call to load in File "/tmp/Megatron/megatron/fused_kernels/__init__.py", line 4
              05:15:56.96 ...... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
              05:15:56.96    4 | def load(args):
              05:15:56.96    5 |     if torch.version.hip is None:
              05:15:56.96    6 |         print("running on CUDA devices")
          running on CUDA devices
              05:15:56.96    7 |         from megatron.fused_kernels.cuda import load as load_kernels
              05:15:58.87 .............. load_kernels = <function load at 0x7fe2422201f0>
              05:15:58.87   12 |     load_kernels(args)
          Detected CUDA files, patching ldflags
          Emitting ninja build file /tmp/Megatron/megatron/fused_kernels/cuda/build/build.ninja...
          Building extension module scaled_upper_triang_masked_softmax_cuda...
          Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
          [1/2] /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o 
          FAILED: scaled_upper_triang_masked_softmax_cuda.cuda.o 
          /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (const char *const)
                    detected during:
                      instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<const char *const &>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(1375): here
                      instantiation of "__nv_bool pybind11::detail::object_api<Derived>::contains(T &&) const [with Derived=pybind11::handle, T=const char *const &]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/detail/internals.h(176): here

          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                    detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(201): here

          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (pybind11::handle, pybind11::handle)
                    detected during:
                      instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::handle &, pybind11::handle &>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(923): here
                      instantiation of "pybind11::str pybind11::str::format(Args &&...) const [with Args=<pybind11::handle &, pybind11::handle &>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(755): here

          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (pybind11::handle, pybind11::handle, pybind11::none, pybind11::str)
                    detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::handle, pybind11::handle, pybind11::none, pybind11::str>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(971): here

          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (pybind11::object, const pybind11::handle)
                    detected during:
                      instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object &, const pybind11::handle &>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(923): here
                      instantiation of "pybind11::str pybind11::str::format(Args &&...) const [with Args=<pybind11::object &, const pybind11::handle &>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1401): here

          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (pybind11::cpp_function)
                    detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::cpp_function>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1407): here

          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (pybind11::cpp_function, pybind11::none, pybind11::none, const char [1])
                    detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::cpp_function, pybind11::none, pybind11::none, const char (&)[1]>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1418): here

          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (pybind11::tuple)
                    detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::tuple &>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1812): here

          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (pybind11::object)
                    detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object &>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1830): here

          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (pybind11::object)
                    detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1831): here

          10 errors detected in the compilation of "/tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu".
          ninja: build stopped: subcommand failed.
              05:16:05.35 !!! RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'
              05:16:05.35 !!! When calling: load_kernels(args)
              05:16:05.35 !!! Call ended by exception
          05:16:05.35  202 |         fused_kernels.load(args)
          05:16:05.39 !!! RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'
          05:16:05.39 !!! When calling: fused_kernels.load(args)
          05:16:05.39 !!! Call ended by exception
          Traceback (most recent call last):
            File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1516, in _run_ninja_build
              subprocess.run(
            File "/opt/conda/envs/starcoder/lib/python3.8/subprocess.py", line 516, in run
              raise CalledProcessError(retcode, process.args,
          subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

          The above exception was the direct cause of the following exception:

          Traceback (most recent call last):
            File "pretrain_gpt.py", line 158, in <module>
              pretrain(train_valid_test_datasets_provider, model_provider,
            File "/tmp/Megatron/megatron/training.py", line 107, in pretrain
              initialize_megatron(extra_args_provider=extra_args_provider,
            File "/tmp/Megatron/megatron/initialize.py", line 106, in initialize_megatron
              _compile_dependencies()
            File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/snoop/tracer.py", line 173, in simple_wrapper
              return function(*args, **kwargs)
            File "/tmp/Megatron/megatron/initialize.py", line 202, in _compile_dependencies
              fused_kernels.load(args)
            File "/tmp/Megatron/megatron/fused_kernels/__init__.py", line 12, in load
              load_kernels(args)
            File "/tmp/Megatron/megatron/fused_kernels/cuda/__init__.py", line 70, in load
              scaled_upper_triang_masked_softmax_cuda = _cpp_extention_load_helper(
            File "/tmp/Megatron/megatron/fused_kernels/cuda/__init__.py", line 42, in _cpp_extention_load_helper
              return cpp_extension.load(
            File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 969, in load
              return _jit_compile(
            File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1176, in _jit_compile
              _write_ninja_file_and_build_library(
            File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1280, in _write_ninja_file_and_build_library
              _run_ninja_build(
            File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1538, in _run_ninja_build
              raise RuntimeError(message) from e
          RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'
          Traceback (most recent call last):
            File "/opt/conda/envs/starcoder/lib/python3.8/runpy.py", line 194, in _run_module_as_main
              return _run_code(code, main_globals, None,
            File "/opt/conda/envs/starcoder/lib/python3.8/runpy.py", line 87, in _run_code
              exec(code, run_globals)
            File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module>
              main()
            File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
              raise subprocess.CalledProcessError(returncode=process.returncode,
          subprocess.CalledProcessError: Command '['/opt/conda/envs/starcoder/bin/python', '-u', 'pretrain_gpt.py', '--local_rank=0', '--tensor-model-parallel-size', '1', '--pipeline-model-parallel-size', '1', '--num-layers', '40', '--hidden-size', '6144', '--num-attention-heads', '48', '--attention-head-type', 'multiquery', '--init-method-std', '0.01275', '--seq-length', '8192', '--max-position-embeddings', '8192', '--attention-dropout', '0.1', '--hidden-dropout', '0.1', '--micro-batch-size', '1', '--global-batch-size', '512', '--lr', '0.0003', '--min-lr', '0.00003', '--train-iters', '250000', '--lr-decay-iters', '250000', '--lr-decay-style', 'cosine', '--lr-warmup-iters', '2000', '--weight-decay', '.1', '--adam-beta2', '.95', '--clip-grad', '1.0', '--bf16', '--use-flash-attn', '--fim-rate', '0.5', '--log-interval', '10', '--save-interval', '2500', '--eval-interval', '2500', '--eval-iters', '2', '--use-distributed-optimizer', '--valid-num-workers', '0', '--tokenizer-type', 'TokenizerFromFile', '--tokenizer-file', '/home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json', '--save', '/home/jupyter/Satya/Megatron/Model_starcoder/', '--load', '/home/jupyter/Satya/Megatron/Model_starcoder/']' returned non-zero exit status 1.
          examples/pretrain_starcoder.sh: line 75: --structured-logs: command not found

in the above code i also tried using snoop trace. Below is the main error.

      Detected CUDA files, patching ldflags
      Emitting ninja build file /tmp/Megatron/megatron/fused_kernels/cuda/build/build.ninja...
      Building extension module scaled_upper_triang_masked_softmax_cuda...
      Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
      [1/2] /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o 
         FAILED: scaled_upper_triang_masked_softmax_cuda.cuda.o

bigcode-project / Megatron-LM

RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda' #61