[BUG/Help] <使用deepspeed训练总是被kill>

younger-diao commented 1 year ago

Is there an existing issue for this?

[ ] I have searched the existing issues

Current Behavior

[2023-04-20 15:28:24,402] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2023-04-20 15:28:25,648] [WARNING] [cpu_adam.py:84:init] FP16 params for CPUAdam may not work on AMD CPUs [2023-04-20 15:28:25,648] [WARNING] [cpu_adam.py:84:init] FP16 params for CPUAdam may not work on AMD CPUs Installed CUDA version 11.2 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.2 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/comleader/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /home/comleader/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/comleader/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Loading extension module cpu_adam... Time to load cpu_adam op: 2.7159857749938965 seconds Time to load cpu_adam op: 2.7485086917877197 seconds Adam Optimizer #0 is created with AVX2 arithmetic capability. Config: alpha=0.000100, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 [2023-04-20 15:28:30,114] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2023-04-20 15:28:30,129] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-04-20 15:28:30,129] [INFO] [utils.py:51:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2023-04-20 15:28:30,129] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer [2023-04-20 15:28:30,129] [INFO] [stage_1_and_2.py:133:init] Reduce bucket size 200000000 [2023-04-20 15:28:30,129] [INFO] [stage_1_and_2.py:134:init] Allgather bucket size 200000000 [2023-04-20 15:28:30,129] [INFO] [stage_1_and_2.py:135:init] CPU Offload: True [2023-04-20 15:28:30,129] [INFO] [stage_1_and_2.py:136:init] Round robin gradient partitioning: False Using /home/comleader/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /home/comleader/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Emitting ninja build file /home/comleader/.cache/torch_extensions/py39_cu117/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.35907936096191406 seconds Loading extension module utils... Time to load utils op: 0.40230345726013184 seconds [2023-04-20 15:28:41,920] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1104052 [2023-04-20 15:28:43,576] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1104053 [2023-04-20 15:28:43,576] [ERROR] [launch.py:434:sigkill_handler] ['/home/comleader/anaconda3/envs/ChatGLM/bin/python3.9', '-u', 'main.py', '--local_rank=1', '--deepspeed', 'deepspeed1.json', '--do_train', '--train_file', 'AdvertiseGen/train.json', '--test_file', 'AdvertiseGen/dev.json', '--prompt_column', 'content', '--response_column', 'summary', '--overwrite_cache', '--model_name_or_path', 'THUDM/chatglm-6b', '--output_dir', './output/adgen-chatglm-6b-ft-1e-4', '--overwrite_output_dir', '--max_source_length', '64', '--max_target_length', '64', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--predict_with_generate', '--max_steps', '5000', '--logging_steps', '10', '--save_steps', '1000', '--learning_rate', '1e-4', '--fp16'] exits with return code = -9

Expected Behavior

No response

Steps To Reproduce

1.使用2*3090gpu 2.deepseed.json： { "train_micro_batch_size_per_gpu": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": false, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients" : true } } 3.bash ds_train_finetune.sh 然后出现以上报错日志

Environment

- OS:ubuntu20.04
- Python:3.9
- Transformers:4.27.1
- PyTorch:2.0
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

LvShuaiChao commented 1 year ago

可能是显存不足吧，你换用内存或者显存更大的服务器来试一下应该就好了

twosnowman commented 1 year ago

deepspeed训练需要多大显存？

hexiaojin1314 commented 1 year ago

我也碰到这个问题

THUDM / ChatGLM-6B