bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.3k stars 211 forks source link

Exception: cuda rng state model-parallel-rng is not added #369

Open 520jefferson opened 1 year ago

520jefferson commented 1 year ago

i start the job the i met this error: cuda: 12.0 torch: 1.14

deepspeed --num_gpus 2 pretrain_gpt_v2.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --distributed-backend nccl --num-layers 2 --hidden-size 64 --num-attention-heads 2 --seq-length 1024 --max-position-embeddings 1024 --micro-batch-size 1 --rampup-batch-size 2 2 1_000 --global-batch-size 16 --train-samples 100 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.95 --adam-eps 1e-8 --lr 1e-4 --lr-warmup-samples 5 --clip-grad 1.0 --weight-decay 1e-1 --vocab-file /mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/gpt2-vocab.json --merge-file /mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/gpt2-merges.txt --fp16 --log-interval 10 --save-interval 100 --eval-interval 100 --eval-iters 10 --checkpoint-activations --save alibi_test --load alibi_test --data-path /mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/BookCorpusDataset_text_document --tensorboard-dir output_dir_tensorboard --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --deepspeed --deepspeed_config ./ds_config.json --zero-stage 1 --deepspeed-activation-checkpointing [2023-03-06 04:10:57,331] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-03-06 04:10:57,454] [INFO] [runner.py:548:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None pretrain_gpt_v2.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --distributed-backend nccl --num-layers 2 --hidden-size 64 --num-attention-heads 2 --seq-length 1024 --max-position-embeddings 1024 --micro-batch-size 1 --rampup-batch-size 2 2 1_000 --global-batch-size 16 --train-samples 100 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.95 --adam-eps 1e-8 --lr 1e-4 --lr-warmup-samples 5 --clip-grad 1.0 --weight-decay 1e-1 --vocab-file /mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/gpt2-vocab.json --merge-file /mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/gpt2-merges.txt --fp16 --log-interval 10 --save-interval 100 --eval-interval 100 --eval-iters 10 --checkpoint-activations --save alibi_test --load alibi_test --data-path /mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/BookCorpusDataset_text_document --tensorboard-dir output_dir_tensorboard --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --deepspeed --deepspeed_config ./ds_config.json --zero-stage 1 --deepspeed-activation-checkpointing [2023-03-06 04:10:59,741] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.16.5 [2023-03-06 04:10:59,741] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1]} [2023-03-06 04:10:59,741] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0 [2023-03-06 04:10:59,741] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2023-03-06 04:10:59,741] [INFO] [launch.py:162:main] dist_world_size=2 [2023-03-06 04:10:59,741] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1 using world size: 2, data-parallel-size: 2, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 using torch.float16 for parameters ... ------------------------ arguments ------------------------ abort_on_unmet_fused_kernel_constraints ......... False accumulate_allreduce_grads_in_fp32 .............. False adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.95 adam_eps ........................................ 1e-08 adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_query_key_layer_scaling ................... True apply_residual_connection_post_layernorm ........ False attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False bert_binary_head ................................ True bert_load ....................................... None bf16 ............................................ False bias_dropout_fusion ............................. True bias_gelu_fusion ................................ True biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None checkpoint_activations .......................... True checkpoint_in_cpu ............................... False checkpoint_num_layers ........................... 1 clip_grad ....................................... 1.0 codecarbon_dir .................................. None consumed_train_samples .......................... 0 consumed_train_tokens ........................... 0 consumed_valid_samples .......................... 0 contigious_checkpointing ........................ False cpu_optimizer ................................... False cpu_torch_adam .................................. False curriculum_learning ............................. False data_impl ....................................... infer data_parallel_size .............................. 2 data_path ....................................... ['/mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/BookCorpusDataset_text_document'] dataloader_type ................................. single DDP_impl ........................................ local decoder_seq_length .............................. None deepscale ....................................... False deepscale_config ................................ None deepspeed ....................................... True deepspeed_activation_checkpointing .............. True deepspeed_config ................................ ./ds_config.json deepspeed_mpi ................................... False distribute_checkpointed_activations ............. False distributed_backend ............................. nccl embed_layernorm ................................. False embedding_path .................................. None encoder_seq_length .............................. 1024 eod_mask_loss ................................... False eval_interval ................................... 100 eval_iters ...................................... 10 eval_only ....................................... None evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None ffn_hidden_size ................................. 256 finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False gigaflos_no_embeds .............................. 0 global_batch_size ............................... 16 glu_activation .................................. None hidden_dropout .................................. 0.1 hidden_size ..................................... 64 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_dim ......................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 inference ....................................... False init_method_std ................................. 0.02 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 kill_switch_path ................................ None kv_channels ..................................... 32 layernorm_epsilon ............................... 1e-05 lazy_mpu_init ................................... None load ............................................ alibi_test local_rank ...................................... 0 log_batch_size_to_tensorboard ................... True log_interval .................................... 10 log_learning_rate_to_tensorboard ................ True log_level ....................................... None log_level_replica ............................... None log_loss_scale_to_tensorboard ................... True log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_path ........................................ None log_timers_to_tensorboard ....................... True log_validation_ppl_to_tensorboard ............... True loss_on_targets_only ............................ False loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. 0.0001 lr_decay_iters .................................. None lr_decay_samples ................................ None lr_decay_style .................................. linear lr_decay_tokens ................................. None lr_warmup_fraction .............................. None lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 5 make_vocab_size_divisible_by .................... 128 mask_prob ....................................... 0.15 masked_softmax_fusion ........................... True max_position_embeddings ......................... 1024 mean_noise_span_length .......................... None memory_centric_tiled_linear ..................... False merge_file ...................................... /mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/gpt2-merges.txt micro_batch_size ................................ 1 min_loss_scale .................................. 1.0 min_lr .......................................... 0.0 mmap_warmup ..................................... False no_load_optim ................................... None no_load_rng ..................................... None no_save_optim ................................... None no_save_rng ..................................... None noise_density ................................... None num_attention_heads ............................. 2 num_channels .................................... 3 num_classes ..................................... 1000 num_layers ...................................... 2 num_layers_per_virtual_pipeline_stage ........... None num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False optimizer ....................................... adam override_lr_scheduler ........................... False pad_vocab_size_to ............................... None params_dtype .................................... torch.float16 partition_activations ........................... False patch_dim ....................................... 16 pipeline_model_parallel_size .................... 1 position_embedding_type ......................... PositionEmbeddingType.absolute pp_partition_method ............................. None profile_backward ................................ False query_in_block_prob ............................. 0.1 rampup_batch_size ............................... ['2', '2', '1_000'] rank ............................................ 0 remote_device ................................... none reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 reweight_loss_based_on_position_frequency ....... False sample_rate ..................................... 1.0 save ............................................ alibi_test save_interval ................................... 100 scatter_gather_tensors_in_pipeline .............. True scattered_embeddings ............................ False seed ............................................ 1234 seq_length ...................................... 1024 sgd_momentum .................................... 0.9 short_seq_prob .................................. 0.1 skip_train_iteration_range ...................... None split ........................................... 969, 30, 1 split_transformers .............................. False sync_tp_duplicated_parameters ................... False synchronize_each_layer .......................... False tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. output_dir_tensorboard tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 5 test_weighted_split_names ....................... None test_weighted_split_paths ....................... None test_weighted_split_paths_path .................. None test_weighted_split_splits ...................... None test_weighted_split_weights ..................... None tile_factor ..................................... 1 titles_data_path ................................ None tokenizer_name_or_path .......................... None tokenizer_type .................................. GPT2BPETokenizer train_iters ..................................... None train_samples ................................... 100 train_tokens .................................... None train_weighted_split_paths ...................... None train_weighted_split_paths_path ................. None universal_checkpoint ............................ False use_bnb_optimizer ............................... False use_checkpoint_lr_scheduler ..................... False use_contiguous_buffers_in_ddp ................... False use_cpu_initialization .......................... None use_one_sent_docs ............................... False use_pin_memory .................................. False valid_num_workers ............................... 2 valid_weighted_split_names ...................... None valid_weighted_split_paths ...................... None valid_weighted_split_paths_path ................. None valid_weighted_split_splits ..................... None valid_weighted_split_weights .................... None virtual_pipeline_model_parallel_size ............ None vocab_extra_ids ................................. 0 vocab_file ...................................... /mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/gpt2-vocab.json weight_decay .................................... 0.1 world_size ...................................... 2 zero_allgather_bucket_size ...................... 0.0 zero_contigious_gradients ....................... False zero_reduce_bucket_size ......................... 0.0 zero_reduce_scatter ............................. False zero_stage ...................................... 1 -------------------- end of arguments --------------------- will use batch size rampup starting from global batch size 2 to global batch size 16 with batch size increments 2 over 1000 samples.

building GPT2BPETokenizer tokenizer ... padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 1.14.0a0+44dac51 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.8.1, unknown, unknown torch cuda version ............... 12.0 torch hip version ................ None nvcc version ..................... 12.0 deepspeed wheel compiled w. ...... torch 1.14, cuda 12.0 setting tensorboard ... Git info for Megatron: git_hash=e52bdab git_branch=main initializing torch distributed ... [2023-03-06 04:11:06,233] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl initializing tensor model parallel with size 1 initializing pipeline model parallel with size 1 setting random seeds to 1234 ... compiling dataset index builder ... make: Entering directory '/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/data' make: Nothing to be done for 'default'. make: Leaving directory '/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/data'

done with dataset index builder. Compilation time: 0.107 seconds WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations. compiling and loading fused kernels ... Detected CUDA files, patching ldflags Emitting ninja build file /mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... Building extension module scaled_upper_triang_masked_softmax_cuda... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module scaled_upper_triang_masked_softmax_cuda... Detected CUDA files, patching ldflags Emitting ninja build file /mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... Building extension module scaled_masked_softmax_cuda... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module scaled_masked_softmax_cuda... Detected CUDA files, patching ldflags Emitting ninja build file /mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... Building extension module fused_mix_prec_layer_norm_cuda... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_mix_prec_layer_norm_cuda... done with compiling and loading fused kernels. Compilation time: 2.865 seconds time to initialize megatron (seconds): -33.941 [after megatron is initialized] datetime: 2023-03-06 04:11:10 building GPT model ... args.deepspeed: True args.deepspeed_config: ./ds_config.json args.deepspeed: goes deepspeed ................ SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1} [2023-03-06 04:11:10,122] [INFO] [module.py:370:_partition_layers] Partitioning pipeline stages with method type:transformer stage=0 layers=9 0: _to_float16 1: EmbeddingPipe 2: 3: ParallelTransformerLayerPipe 4: ParallelTransformerLayerPipe 5: undo 6: MixedFusedLayerNorm 7: EmbeddingPipe 8: float16_to_fp32 loss: CrossEntropy Traceback (most recent call last): File "pretrain_gpt_v2.py", line 243, in Traceback (most recent call last): File "pretrain_gpt_v2.py", line 243, in main() File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "pretrain_gpt_v2.py", line 238, in main main() File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, *kwargs) File "pretrain_gpt_v2.py", line 238, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/training.py", line 401, in setup_model_and_optimizer model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/training.py", line 401, in setup_model_and_optimizer model = get_model(model_provider_func) File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/training.py", line 269, in get_model model = get_model(model_provider_func) File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/training.py", line 269, in get_model model = model_provider_func( File "pretrain_gpt_v2.py", line 63, in model_provider model = model_provider_func( File "pretrain_gpt_v2.py", line 63, in model_provider model = GPTModelPipe( File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/model/gpt_model.py", line 315, in init model = GPTModelPipe( File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/model/gpt_model.py", line 315, in init super().init(layers=self.specs, File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 203, in init super().init(layers=self.specs, File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 203, in init self._build() File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 238, in _build self._build() File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 238, in _build self.tied_modules[layer.key] = layer.build() File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 69, in build self.tied_modules[layer.key] = layer.build() File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 69, in build return self.typename(self.module_args, self.module_kwargs) File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/model/language_model.py", line 131, in init return self.typename(*self.module_args, **self.module_kwargs) File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/model/language_model.py", line 131, in init self.word_embeddings = mpu.VocabParallelEmbedding( File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/mpu/layers.py", line 213, in init self.word_embeddings = mpu.VocabParallelEmbedding( File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/mpu/layers.py", line 213, in init _initialize_affine_weight_gpu(self.weight, init_method, File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/mpu/layers.py", line 95, in _initialize_affine_weight_gpu _initialize_affine_weight_gpu(self.weight, init_method, File "/mnt/dp_mega/BS-Megatron-DeepSpeed/megatron/mpu/layers.py", line 95, in _initialize_affine_weight_gpu with get_cuda_rng_tracker().fork(): File "/usr/lib/python3.8/contextlib.py", line 113, in enter with get_cuda_rng_tracker().fork(): File "/usr/lib/python3.8/contextlib.py", line 113, in enter return next(self.gen) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 174, in fork return next(self.gen) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 174, in fork raise Exception('cuda rng state {} is not added'.format(name)) raise Exception('cuda rng state {} is not added'.format(name)) Exception: cuda rng state model-parallel-rng is not added Exception: cuda rng state model-parallel-rng is not added [2023-03-06 04:11:11,771] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 866 [2023-03-06 04:11:11,771] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 867 [2023-03-06 04:11:11,772] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python', '-u', 'pretrain_gpt_v2.py', '--local_rank=1', '--tensor-model-parallel-size', '1', '--pipeline-model-parallel-size', '1', '--distributed-backend', 'nccl', '--num-layers', '2', '--hidden-size', '64', '--num-attention-heads', '2', '--seq-length', '1024', '--max-position-embeddings', '1024', '--micro-batch-size', '1', '--rampup-batch-size', '2', '2', '1_000', '--global-batch-size', '16', '--train-samples', '100', '--optimizer', 'adam', '--adam-beta1', '0.9', '--adam-beta2', '0.95', '--adam-eps', '1e-8', '--lr', '1e-4', '--lr-warmup-samples', '5', '--clip-grad', '1.0', '--weight-decay', '1e-1', '--vocab-file', '/mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/gpt2-vocab.json', '--merge-file', '/mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/gpt2-merges.txt', '--fp16', '--log-interval', '10', '--save-interval', '100', '--eval-interval', '100', '--eval-iters', '10', '--checkpoint-activations', '--save', 'alibi_test', '--load', 'alibi_test', '--data-path', '/mnt/dp_mega/Microsoft-Megatron-DeepSpeed/dataset/BookCorpusDataset_text_document', '--tensorboard-dir', 'output_dir_tensorboard', '--tensorboard-queue-size', '5', '--log-timers-to-tensorboard', '--log-batch-size-to-tensorboard', '--log-validation-ppl-to-tensorboard', '--deepspeed', '--deepspeed_config', './ds_config.json', '--zero-stage', '1', '--deepspeed-activation-checkpointing'] exits with return code = 1

XaviLv commented 1 year ago

See here