OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.
https://optimalscale.github.io/LMFlow/
Apache License 2.0
8.11k stars 819 forks source link

multi-gpu full para train error #855

Open tankeui opened 2 weeks ago

tankeui commented 2 weeks ago

[2024-06-12 19:36:07,800] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-12 19:36:09,648] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-06-12 19:36:09,648] [INFO] [runner.py:568:main] cmd = anaconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None LMFlow/examples/finetune.py --model_name_or_path huggingface/hub/Meta-Llama-3-70B --trust_remote_code 0 --dataset_path LMFlow/data/alpaca/train_conversation --output_dir output_models/finetune --overwrite_output_dir --conversation_template llama3 --num_train_epochs 0.01 --learning_rate 2e-5 --disable_group_texts 1 --block_size 256 --per_device_train_batch_size 1 --deepspeed LMFlow/configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1 [2024-06-12 19:36:11,661] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-12 19:36:12,366] [INFO] [launch.py:138:main] 0 TORCH_NCCL_BLOCKING_WAIT=1 [2024-06-12 19:36:12,366] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]} [2024-06-12 19:36:12,366] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0 [2024-06-12 19:36:12,366] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2024-06-12 19:36:12,366] [INFO] [launch.py:163:main] dist_world_size=2 [2024-06-12 19:36:12,366] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2024-06-12 19:36:12,419] [INFO] [launch.py:253:main] process 40472 spawned with command: ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] [2024-06-12 19:36:12,466] [INFO] [launch.py:253:main] process 40473 spawned with command: ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=1', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] [2024-06-12 19:36:17,298] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-12 19:36:17,298] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( [2024-06-12 19:36:20,965] [INFO] [comm.py:637:init_distributed] cdb=None [2024-06-12 19:36:20,965] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-06-12 19:36:20,965] [INFO] [comm.py:637:init_distributed] cdb=None rank1: Traceback (most recent call last): rank1: File "LMFlow/examples/finetune.py", line 61, in

rank1: File "LMFlow/examples/finetune.py", line 44, in main rank1: model_args, data_args, pipeline_args = parser.parse_args_into_dataclasses() rank1: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses rank1: obj = dtype(**inputs) rank1: File "", line 135, in init rank1: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/training_args.py", line 1641, in __post_init__

rank1: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/training_args.py", line 2149, in device rank1: return self._setup_devices rank1: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/utils/generic.py", line 59, in get rank1: cached = self.fget(obj) rank1: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/training_args.py", line 2077, in _setup_devices rank1: self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout)) rank1: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/accelerate/state.py", line 280, in init

rank1: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/accelerate/state.py", line 790, in set_device

rank1: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/cuda/init.py", line 399, in set_device

rank1: RuntimeError: CUDA error: invalid device ordinal rank1: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

06/12/2024 19:36:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( [2024-06-12 19:36:22,477] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 40472 [2024-06-12 19:36:22,531] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 40473 [2024-06-12 19:36:22,531] [ERROR] [launch.py:322:sigkill_handler] ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=1', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1

`#!/bin/bash

Please run this script under ${project_id} in project directory of

https://github.com/shizhediao/llm-ft

COMMIT: d5fecf30ba8011067b10cf51fede53a5ab6574e4

export TORCH_SHOW_CPP_STACKTRACES = 1

export TORCH_NCCL_BLOCKING_WAIT=1 export CUDA_LAUNCH_BLOCKING=1

export TORCH_USE_CUDA_DSA=1

Parses arguments

model_name_or_path=huggingface/hub/Meta-Llama-3-70B dataset_path=LMFlow/data/alpaca/train_conversation output_dir=output_models/finetune deepspeed_args="--num_gpus=2 --master_port=11000" conversation_template=llama3

Safety related arguments

trust_remote_code=0

while [[ $# -ge 1 ]]; do key="$1" case ${key} in -m|--model_name_or_path) model_name_or_path="$2" shift ;; -d|--dataset_path) dataset_path="$2" shift ;; -o|--output_model_path) output_dir="$2" shift ;; --conversation_template) conversation_template="$2" shift ;; --deepspeed_args) deepspeed_args="$2" shift ;; --trust_remote_code) trust_remote_code="$2" shift ;; *) echo "error: unknown option \"${key}\"" 1>&2 exit 1 esac shift done

Finetune

exp_id=finetune project_dir=$(cd "$(dirname $0)"/..; pwd) log_dir=${project_dir}/log/${exp_id} mkdir -p ${output_dir} ${log_dir}

deepspeed ${deepspeed_args} \ LMFlow/examples/finetune.py \ --model_name_or_path ${model_name_or_path} \ --trust_remote_code ${trust_remote_code} \ --dataset_path ${dataset_path} \ --output_dir ${output_dir} --overwrite_output_dir \ --conversation_template ${conversation_template} \ --num_train_epochs 0.01 \ --learning_rate 2e-5 \ --disable_group_texts 1 \ --block_size 256 \ --per_device_train_batch_size 1 \ --deepspeed LMFlow/configs/ds_config_zero3.json \ --fp16 \ --run_name finetune \ --validation_split_percentage 0 \ --logging_steps 20 \ --do_train \ --ddp_timeout 72000 \ --save_steps 5000 \ --dataloader_num_workers 1 \ | tee ${log_dir}/train.log \ 2> ${log_dir}/train.err`

How can I fix this problem?
wheresmyhair commented 2 weeks ago

It seems like a cuda device mismatch issue.

[rank1]: RuntimeError: CUDA error: invalid device ordinal

I guess you've set CUDA_VISIBLE_DEVICES somewhere else accidently and leads to a mismatch. Maybe look at: https://stackoverflow.com/questions/64334033/how-to-solve-runtimeerror-cuda-error-invalid-device-ordinal Or, try change:

deepspeed_args="--num_gpus=2 --master_port=11000" 

to

deepspeed_args="--include localhost:x,x --master_port=11000"
tankeui commented 2 weeks ago

It seems like a cuda device mismatch issue.

[rank1]: RuntimeError: CUDA error: invalid device ordinal

I guess you've set CUDA_VISIBLE_DEVICES somewhere else accidently and leads to a mismatch. Maybe look at: https://stackoverflow.com/questions/64334033/how-to-solve-runtimeerror-cuda-error-invalid-device-ordinal Or, try change:

deepspeed_args="--num_gpus=2 --master_port=11000" 

to

deepspeed_args="--include localhost:x,x --master_port=11000"

Thanks, it solves my problem.