Closed phamkhactu closed 1 year ago
The error appears to be
fatal: detected dubious ownership in repository at '/workspace'
To add an exception for this directory, call:
git config --global --add safe.directory /workspace
fatal: detected dubious ownership in repository at '/workspace'
To add an exception for this directory, call:
git config --global --add safe.directory /workspace
The reason that this is an issue is detailed here. The quoted passage tells you how to bypass this check, but if you are using a shared computer (e.g., a university cluster) you should not do so without thinking about it very carefully. The most likely core explanation is that something in the permissions of your computer are misconfigured.
Hi @StellaAthena I've set configs as you mention. The error git has disappeared. But I still get error:
Here is my logs:
---------------- end of arguments ----------------
NeoXArgs.configure_distributed_args() using world size: 1 and model-parallel size: 1
[2023-05-18 02:58:36,147] [WARNING] [runner.py:193:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-18 02:58:36,147] [INFO] [runner.py:559:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --deepspeed_config eyJ0cmFpbl9iYXRjaF9zaXplIjogMiwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDEsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDA2LCAiYmV0YXMiOiBbMC45LCAwLjk5OV0sICJlcHMiOiAxZS0wOH19LCAiZnAzMl9hbGxyZWR1Y2UiOiB0cnVlLCAiZnAxNiI6IHsiZW5hYmxlZCI6IHRydWUsICJ0eXBlIjogImJmbG9hdDE2IiwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMCwgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ== --megatron_config /workspace/megatron_config.json
[2023-05-18 02:58:37,125] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.12.10-1+cuda11.6
[2023-05-18 02:58:37,125] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.12.10-1
[2023-05-18 02:58:37,125] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.12.10-1
[2023-05-18 02:58:37,125] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-05-18 02:58:37,125] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.12.10-1+cuda11.6
[2023-05-18 02:58:37,125] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-05-18 02:58:37,125] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.12.10-1
[2023-05-18 02:58:37,125] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-05-18 02:58:37,125] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-05-18 02:58:37,125] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-05-18 02:58:37,125] [INFO] [launch.py:162:main] dist_world_size=2
[2023-05-18 02:58:37,125] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1
NeoXArgs.configure_distributed_args() using world size: 2 and model-parallel size: 1
> building GPT2BPETokenizer tokenizer ...
> padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later and do you have tensorboard installed?), no TensorBoard logs will be written.
> initializing torch distributed ...
[2023-05-18 02:58:39,546] [INFO] [comm.py:661:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
> initializing model parallel with size 1
MPU DP: [0, 1]
MPU PP: [0]
MPU PP: [1]
MPU MP: [0]
MPU MP: [1]
> setting random seeds to 1234 ...
[2023-05-18 02:58:39,642] [INFO] [checkpointing.py:227:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/workspace/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/workspace/megatron/data'
building GPT2 model ...
SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1}
[2023-05-18 02:58:40,415] [INFO] [module.py:372:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp
stage=0 layers=17
0: EmbeddingPipe
1: _pre_transformer_block
2: ParallelTransformerLayerPipe
3: ParallelTransformerLayerPipe
4: ParallelTransformerLayerPipe
5: ParallelTransformerLayerPipe
6: ParallelTransformerLayerPipe
7: ParallelTransformerLayerPipe
8: ParallelTransformerLayerPipe
9: ParallelTransformerLayerPipe
10: ParallelTransformerLayerPipe
11: ParallelTransformerLayerPipe
12: ParallelTransformerLayerPipe
13: ParallelTransformerLayerPipe
14: _post_transformer_block
15: NormPipe
16: ParallelLinearPipe
loss: partial
WARNING: APEX not installed - defaulting to deepspeed's fused adam
Configuring Optimizer type: Adam with params: {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}
WARNING: APEX not installed - defaulting to deepspeed's fused adam
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.11797428131103516 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10116434097290039 seconds
> learning rate decay style: cosine
DeepSpeed is enabled.
[2023-05-18 02:58:40,576] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed info: version=0.8.3+5317ca6, git-hash=5317ca6, git-branch=main
[2023-05-18 02:58:40,933] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-05-18 02:58:40,933] [INFO] [logging.py:77:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-05-18 02:58:40,933] [INFO] [logging.py:77:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-05-18 02:58:40,936] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2023-05-18 02:58:40,936] [INFO] [logging.py:77:log_dist] [Rank 0] Creating fp16 optimizer with dynamic loss scale
[2023-05-18 02:58:40,944] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
[2023-05-18 02:58:40,944] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-05-18 02:58:40,944] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x7fb4615a02b0>
[2023-05-18 02:58:40,944] [INFO] [logging.py:77:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]]
[2023-05-18 02:58:40,945] [INFO] [config.py:1018:print] DeepSpeedEngine configuration:
[2023-05-18 02:58:40,945] [INFO] [config.py:1022:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-05-18 02:58:40,945] [INFO] [config.py:1022:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-05-18 02:58:40,945] [INFO] [config.py:1022:print] amp_enabled .................. False
[2023-05-18 02:58:40,945] [INFO] [config.py:1022:print] amp_params ................... False
[2023-05-18 02:58:40,945] [INFO] [config.py:1022:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-05-18 02:58:40,945] [INFO] [config.py:1022:print] bfloat16_enabled ............. False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] checkpoint_parallel_write_pipeline False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] checkpoint_tag_validation_enabled True
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] checkpoint_tag_validation_fail False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fb462199520>
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] communication_data_type ...... None
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] curriculum_enabled_legacy .... False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] curriculum_params_legacy ..... False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] data_efficiency_enabled ...... False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] dataloader_drop_last ......... False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] disable_allgather ............ False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] dump_state ................... False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] eigenvalue_enabled ........... False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] eigenvalue_gas_boundary_resolution 1
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] eigenvalue_layer_num ......... 0
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] eigenvalue_max_iter .......... 100
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] eigenvalue_stability ......... 1e-06
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] eigenvalue_tol ............... 0.01
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] eigenvalue_verbose ........... False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] elasticity_enabled ........... False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] fp16_auto_cast ............... False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] fp16_enabled ................. True
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] fp16_master_weights_and_gradients False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] global_rank .................. 0
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] grad_accum_dtype ............. None
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] gradient_accumulation_steps .. 1
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] gradient_clipping ............ 0.0
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] gradient_predivide_factor .... 1.0
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] initial_dynamic_scale ........ 65536
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] load_universal_checkpoint .... False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] loss_scale ................... 0
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] memory_breakdown ............. False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] optimizer_legacy_fusion ...... False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] optimizer_name ............... adam
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] optimizer_params ............. {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] pld_enabled .................. False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] pld_params ................... False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] prescale_gradients ........... False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] scheduler_name ............... None
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print] scheduler_params ............. None
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print] sparse_attention ............. None
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print] sparse_gradients_enabled ..... False
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print] steps_per_print .............. 10
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print] train_batch_size ............. 2
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print] train_micro_batch_size_per_gpu 1
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print] use_node_local_storage ....... False
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print] wall_clock_breakdown ......... True
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print] world_size ................... 2
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print] zero_allow_untested_optimizer False
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print] zero_enabled ................. False
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print] zero_force_ds_cpu_optimizer .. True
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print] zero_optimization_stage ...... 0
[2023-05-18 02:58:40,948] [INFO] [config.py:1007:print_user_config] json = {
"train_batch_size": 2,
"train_micro_batch_size_per_gpu": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0006,
"betas": [0.9, 0.999],
"eps": 1e-08
}
},
"fp32_allreduce": true,
"fp16": {
"enabled": true,
"type": "bfloat16",
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 0,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5.000000e+08,
"contiguous_gradients": true
},
"wall_clock_breakdown": true
}
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.12190628051757812 seconds
Loading extension module utils...
Time to load utils op: 0.2017836570739746 seconds
[2023-05-18 02:58:41,150] [INFO] [engine.py:88:__init__] CONFIG: micro_batches=1 micro_batch_size=1
[2023-05-18 02:58:41,171] [INFO] [engine.py:144:__init__] RANK=0 STAGE=0 LAYERS=17 [0, 17) STAGE_PARAMS=162322944 (162.323M) TOTAL_PARAMS=162322944 (162.323M) UNIQUE_PARAMS=162322944 (162.323M)
> number of parameters on model parallel rank 0: 162322944
[2023-05-18 02:58:42,137] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 89
[2023-05-18 02:58:42,138] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90
[2023-05-18 02:58:42,140] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python', '-u', 'train.py', '--local_rank=1', '--deepspeed_config', 'eyJ0cmFpbl9iYXRjaF9zaXplIjogMiwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDEsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDA2LCAiYmV0YXMiOiBbMC45LCAwLjk5OV0sICJlcHMiOiAxZS0wOH19LCAiZnAzMl9hbGxyZWR1Y2UiOiB0cnVlLCAiZnAxNiI6IHsiZW5hYmxlZCI6IHRydWUsICJ0eXBlIjogImJmbG9hdDE2IiwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMCwgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ==', '--megatron_config', '/workspace/megatron_config.json'] exits with return code = -7```
@StellaAthena Thank for your support. I have found my problems.
Hi @phamkhactu , would you be able to share how you fixed this issue? I'm running into the same problems.
Hi @phamkhactu , would you be able to share how you fixed this issue? I'm running into the same problems.
It means that some packages not compatible with env. You should build docker or pull image from sharing docker. It will be fixed
Hi @phamkhactu , would you be able to share how you fixed this issue? I'm running into the same problems.
Hi, sorry for bumping but I had similar error with the same return code with no detailed explanation. I was running GPT-NeoX in a Docker container in local k8s.
My solution was to increase the shm size of container as it's noted in the README and NCCL's docs. Cheers.
Thanks for excellent repo
I follow tutorial to train models, but I get error
My steps
pip install -r requirements/requirements.txt -->python ./megatron/fused_kernels/setup.py install
python prepare_data.py -d ./data
python ./deepy.py train.py -d configs bf16_125M.yml local_setup.yml
Environment:
Here is my full logs