[2023-04-20 15:28:24,402] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-04-20 15:28:25,648] [WARNING] [cpu_adam.py:84:init] FP16 params for CPUAdam may not work on AMD CPUs
[2023-04-20 15:28:25,648] [WARNING] [cpu_adam.py:84:init] FP16 params for CPUAdam may not work on AMD CPUs
Installed CUDA version 11.2 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/comleader/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/comleader/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/comleader/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.7159857749938965 seconds
Time to load cpu_adam op: 2.7485086917877197 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000100, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2023-04-20 15:28:30,114] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-04-20 15:28:30,129] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-04-20 15:28:30,129] [INFO] [utils.py:51:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-04-20 15:28:30,129] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2023-04-20 15:28:30,129] [INFO] [stage_1_and_2.py:133:init] Reduce bucket size 200000000
[2023-04-20 15:28:30,129] [INFO] [stage_1_and_2.py:134:init] Allgather bucket size 200000000
[2023-04-20 15:28:30,129] [INFO] [stage_1_and_2.py:135:init] CPU Offload: True
[2023-04-20 15:28:30,129] [INFO] [stage_1_and_2.py:136:init] Round robin gradient partitioning: False
Using /home/comleader/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/comleader/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Emitting ninja build file /home/comleader/.cache/torch_extensions/py39_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.35907936096191406 seconds
Loading extension module utils...
Time to load utils op: 0.40230345726013184 seconds
[2023-04-20 15:28:41,920] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1104052
[2023-04-20 15:28:43,576] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1104053
[2023-04-20 15:28:43,576] [ERROR] [launch.py:434:sigkill_handler] ['/home/comleader/anaconda3/envs/ChatGLM/bin/python3.9', '-u', 'main.py', '--local_rank=1', '--deepspeed', 'deepspeed1.json', '--do_train', '--train_file', 'AdvertiseGen/train.json', '--test_file', 'AdvertiseGen/dev.json', '--prompt_column', 'content', '--response_column', 'summary', '--overwrite_cache', '--model_name_or_path', 'THUDM/chatglm-6b', '--output_dir', './output/adgen-chatglm-6b-ft-1e-4', '--overwrite_output_dir', '--max_source_length', '64', '--max_target_length', '64', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--predict_with_generate', '--max_steps', '5000', '--logging_steps', '10', '--save_steps', '1000', '--learning_rate', '1e-4', '--fp16'] exits with return code = -9
Is there an existing issue for this?
Current Behavior
[2023-04-20 15:28:24,402] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2023-04-20 15:28:25,648] [WARNING] [cpu_adam.py:84:init] FP16 params for CPUAdam may not work on AMD CPUs [2023-04-20 15:28:25,648] [WARNING] [cpu_adam.py:84:init] FP16 params for CPUAdam may not work on AMD CPUs Installed CUDA version 11.2 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.2 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/comleader/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /home/comleader/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/comleader/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Loading extension module cpu_adam... Time to load cpu_adam op: 2.7159857749938965 seconds Time to load cpu_adam op: 2.7485086917877197 seconds Adam Optimizer #0 is created with AVX2 arithmetic capability. Config: alpha=0.000100, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 [2023-04-20 15:28:30,114] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2023-04-20 15:28:30,129] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-04-20 15:28:30,129] [INFO] [utils.py:51:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2023-04-20 15:28:30,129] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer [2023-04-20 15:28:30,129] [INFO] [stage_1_and_2.py:133:init] Reduce bucket size 200000000 [2023-04-20 15:28:30,129] [INFO] [stage_1_and_2.py:134:init] Allgather bucket size 200000000 [2023-04-20 15:28:30,129] [INFO] [stage_1_and_2.py:135:init] CPU Offload: True [2023-04-20 15:28:30,129] [INFO] [stage_1_and_2.py:136:init] Round robin gradient partitioning: False Using /home/comleader/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /home/comleader/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Emitting ninja build file /home/comleader/.cache/torch_extensions/py39_cu117/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.35907936096191406 seconds Loading extension module utils... Time to load utils op: 0.40230345726013184 seconds [2023-04-20 15:28:41,920] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1104052 [2023-04-20 15:28:43,576] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1104053 [2023-04-20 15:28:43,576] [ERROR] [launch.py:434:sigkill_handler] ['/home/comleader/anaconda3/envs/ChatGLM/bin/python3.9', '-u', 'main.py', '--local_rank=1', '--deepspeed', 'deepspeed1.json', '--do_train', '--train_file', 'AdvertiseGen/train.json', '--test_file', 'AdvertiseGen/dev.json', '--prompt_column', 'content', '--response_column', 'summary', '--overwrite_cache', '--model_name_or_path', 'THUDM/chatglm-6b', '--output_dir', './output/adgen-chatglm-6b-ft-1e-4', '--overwrite_output_dir', '--max_source_length', '64', '--max_target_length', '64', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--predict_with_generate', '--max_steps', '5000', '--logging_steps', '10', '--save_steps', '1000', '--learning_rate', '1e-4', '--fp16'] exits with return code = -9
Expected Behavior
No response
Steps To Reproduce
1.使用2*3090gpu 2.deepseed.json: { "train_micro_batch_size_per_gpu": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": false, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients" : true } } 3.bash ds_train_finetune.sh 然后出现以上报错日志
Environment
Anything else?
No response