OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.
https://optimalscale.github.io/LMFlow/
Apache License 2.0
8.11k stars 819 forks source link

Full parameter fine-tuning cannot be trained #842

Open orderer0001 opened 1 month ago

orderer0001 commented 1 month ago

(lmflow_train) root@duxact:/data/projects/lmflow/LMFlow# ./scripts/run_finetune.sh \ --model_name_or_path /data/guihunmodel8.8B \ --dataset_path /data/projects/lmflow/case_report_data \ --output_model_path /data/projects/lmflow/guihun_fintune_model [2024-05-22 15:23:02,959] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-22 15:23:05,346] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-05-22 15:23:05,346] [INFO] [runner.py:555:main] cmd = /root/anaconda3/envs/lmflow_train/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path /data/guihunmodel8.8B --trust_remote_code 0 --dataset_path /data/projects/lmflow/case_report_data --output_dir /data/projects/lmflow/guihun_fintune_model --overwrite_output_dir --conversation_template llama2 --num_train_epochs 0.01 --learning_rate 2e-5 --disable_group_texts 1 --block_size 256 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1 [2024-05-22 15:23:07,178] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-22 15:23:08,889] [INFO] [launch.py:138:main] 0 NCCL_P2P_DISABLE=1 [2024-05-22 15:23:08,889] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=1 [2024-05-22 15:23:08,889] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2]} [2024-05-22 15:23:08,889] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=3, node_rank=0 [2024-05-22 15:23:08,889] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]}) [2024-05-22 15:23:08,889] [INFO] [launch.py:163:main] dist_world_size=3 [2024-05-22 15:23:08,889] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2 [2024-05-22 15:23:12,326] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-22 15:23:12,845] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-22 15:23:12,878] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( [2024-05-22 15:23:15,313] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-22 15:23:15,313] [INFO] [comm.py:616:init_distributed] cdb=None [2024-05-22 15:23:15,317] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-22 15:23:15,318] [INFO] [comm.py:616:init_distributed] cdb=None [2024-05-22 15:23:15,368] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-22 15:23:15,368] [INFO] [comm.py:616:init_distributed] cdb=None [2024-05-22 15:23:15,368] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 05/22/2024 15:23:16 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( 05/22/2024 15:23:16 - WARNING - lmflow.pipeline.finetuner - Process rank: 2, device: cuda:2, n_gpu: 1,distributed training: True, 16-bits training: True 05/22/2024 15:23:16 - WARNING - lmflow.pipeline.finetuner - Process rank: 1, device: cuda:1, n_gpu: 1,distributed training: True, 16-bits training: True /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( [WARNING|logging.py:314] 2024-05-22 15:23:18,032 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-05-22 15:23:18,186 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-05-22 15:23:18,236 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [2024-05-22 15:23:20,000] [INFO] [partition_parameters.py:326:exit] finished initializing model with 8.03B parameters Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 5/5 [00:15<00:00, 3.00s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 5/5 [00:15<00:00, 3.00s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 5/5 [00:15<00:00, 3.06s/it] [WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled! Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Creating extension directory /root/.cache/torch_extensions/py39_cu121/cpu_adam... [WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled! Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... [WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled! Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/py39_cu121/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/lmflow_train/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -march=native -fopenmp -DAVX512 -D__DISABLE_CUDA__ -c /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o [2/2] c++ cpu_adam.o -shared -fopenmp -L/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so Loading extension module cpu_adam... Time to load cpu_adam op: 19.286750555038452 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 19.286848306655884 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 19.370280504226685 seconds Parameter Offload: Total persistent parameters: 266240 in 65 params [2024-05-22 15:36:23,345] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 806929 [2024-05-22 15:36:23,707] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 806930 [2024-05-22 15:36:28,465] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 806931 [2024-05-22 15:36:33,281] [ERROR] [launch.py:321:sigkill_handler] ['/root/anaconda3/envs/lmflow_train/bin/python', '-u', 'examples/finetune.py', '--local_rank=2', '--model_name_or_path', '/data/guihunmodel8.8B', '--trust_remote_code', '0', '--dataset_path', '/data/projects/lmflow/case_report_data', '--output_dir', '/data/projects/lmflow/guihun_fintune_model', '--overwrite_output_dir', '--conversation_template', 'llama2', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -9

wheresmyhair commented 1 month ago

Thanks for your interest in LMFlow! It seems that your system installed cuda and torch cuda do not match each other. You may refer to: https://github.com/microsoft/DeepSpeed/issues/3613 Feel free to leave a comment if you need further helps.