OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.
https://optimalscale.github.io/LMFlow/
Apache License 2.0
8.2k stars 819 forks source link

最后一个batch的数据处理卡住 #177

Closed xiaotingyun closed 1 year ago

xiaotingyun commented 1 year ago

if not data_args.streaming: lm_datasets = tokenized_datasets.map( group_texts, batched=True, batch_size=group_batch_size, num_proc=data_args.preprocessing_num_workers, load_from_cache_file=not data_args.overwrite_cache, desc=f"Grouping texts in chunks of {block_size}", ) funetuner.py中group_texts方法,在处理最后一个batch的时候卡住,进度条一直停在百分之90多 image

xiaotingyun commented 1 year ago

使用run_finetune_with_lora.sh时单卡能够进行到模型训练阶段,但会报错。双卡则在数据处理阶段卡住。

以下是单卡的日志,A6000,48G显存 [2023-04-09 06:06:04,824] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-09 06:06:04,839] [INFO] [runner.py:550:main] cmd = /data/anaconda3/envs/ljy_lmflow/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None examples/finetune.py --model_name_or_path llama-7b-hf --dataset_path data/MedQA-USMLE/train --output_dir output_models/finetune_with_lora --overwrite_output_dir --num_train_epochs 1 --learning_rate 1e-4 --block_size 512 --per_device_train_batch_size 1 --use_lora 1 --lora_r 8 --save_aggregated_lora 0 --deepspeed configs/ds_config_zero2.json --bf16 --run_name finetune_with_lora --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1 [2023-04-09 06:06:06,273] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]} [2023-04-09 06:06:06,273] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-04-09 06:06:06,273] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-04-09 06:06:06,273] [INFO] [launch.py:162:main] dist_world_size=1 [2023-04-09 06:06:06,273] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-04-09 06:06:08,686] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 04/09/2023 06:06:12 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 04/09/2023 06:06:13 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-dab165c44cd11ccd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199 模型加载完成 04/09/2023 06:07:45 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-dab165c44cd11ccd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-cf11370b80a41887.arrow 04/09/2023 06:07:45 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-dab165c44cd11ccd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-6a1a7cb60074899f.arrow 数据处理完成 ninja: no work to do. Time to load cpu_adam op: 2.736064910888672 seconds ninja: no work to do. Time to load utils op: 0.34513020515441895 seconds [2023-04-09 06:07:59,406] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 56725 [2023-04-09 06:07:59,406] [ERROR] [launch.py:324:sigkill_handler] ['/data/anaconda3/envs/ljy_lmflow/bin/python3.9', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'llama-7b-hf', '--dataset_path', 'data/MedQA-USMLE/train', '--output_dir', 'output_models/finetune_with_lora', '--overwrite_output_dir', '--num_train_epochs', '1', '--learning_rate', '1e-4', '--block_size', '512', '--per_device_train_batch_size', '1', '--use_lora', '1', '--lora_r', '8', '--save_aggregated_lora', '0', '--deepspeed', 'configs/ds_config_zero2.json', '--bf16', '--run_name', 'finetune_with_lora', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -11

以下是双卡的日志,2*A6000 [2023-04-09 05:58:21,015] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-09 05:58:21,031] [INFO] [runner.py:550:main] cmd = /data/anaconda3/envs/ljy_lmflow/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None examples/finetune.py --model_name_or_path llama-7b-hf --dataset_path data/MedQA-USMLE/train --output_dir output_models/finetune_with_lora --overwrite_output_dir --num_train_epochs 1 --learning_rate 1e-4 --block_size 512 --per_device_train_batch_size 1 --use_lora 1 --lora_r 8 --save_aggregated_lora 0 --deepspeed configs/ds_config_zero2.json --bf16 --run_name finetune_with_lora --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1 [2023-04-09 05:58:22,516] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1]} [2023-04-09 05:58:22,517] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0 [2023-04-09 05:58:22,517] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2023-04-09 05:58:22,517] [INFO] [launch.py:162:main] dist_world_size=2 [2023-04-09 05:58:22,517] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2023-04-09 05:58:24,948] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 04/09/2023 06:00:38 - WARNING - lmflow.pipeline.finetuner - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False 04/09/2023 06:00:39 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-dab165c44cd11ccd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51... Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-dab165c44cd11ccd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data. 04/09/2023 06:00:40 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-dab165c44cd11ccd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199 模型加载完成 trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199 模型加载完成

xiaotingyun commented 1 year ago

使用run_finetune.sh时,单卡训练过程报错,双卡模型加载阶段卡住

以下是单卡日志 [2023-04-09 06:37:33,738] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-09 06:37:33,754] [INFO] [runner.py:550:main] cmd = /data/anaconda3/envs/ljy_lmflow/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None examples/finetune.py --model_name_or_path llama-7b-hf --dataset_path data/MedQA-USMLE/train --output_dir output_models --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 2e-5 --block_size 512 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --bf16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 60 --save_steps 5000 --dataloader_num_workers 1 [2023-04-09 06:37:35,197] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]} [2023-04-09 06:37:35,197] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-04-09 06:37:35,197] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-04-09 06:37:35,197] [INFO] [launch.py:162:main] dist_world_size=1 [2023-04-09 06:37:35,197] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-04-09 06:37:37,596] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 04/09/2023 06:37:38 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 04/09/2023 06:37:40 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-dab165c44cd11ccd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) [2023-04-09 06:37:48,541] [INFO] [partition_parameters.py:415:exit] finished initializing model with 6.74B parameters 模型加载完成 04/09/2023 06:38:01 - WARNING - datasets.fingerprint - Parameter 'function'=<function HFDecoderModel.tokenize..tokenize_function at 0x7f7204255160> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. 04/09/2023 06:38:01 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-dab165c44cd11ccd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-1c80317fa3b1799d.arrow 04/09/2023 06:38:01 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-dab165c44cd11ccd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-bbe2d282518ba636.arrow 数据处理完成 ninja: no work to do. Time to load cpu_adam op: 2.7378013134002686 seconds ninja: no work to do. Time to load utils op: 0.3301382064819336 seconds Parameter Offload: Total persistent parameters: 266240 in 65 params [2023-04-09 06:38:11,247] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 9549 [2023-04-09 06:38:11,248] [ERROR] [launch.py:324:sigkill_handler] ['/data/anaconda3/envs/ljy_lmflow/bin/python3.9', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'llama-7b-hf', '--dataset_path', 'data/MedQA-USMLE/train', '--output_dir', 'output_models', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--block_size', '512', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '60', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -11

以下是双卡日志 [2023-04-09 06:27:16,850] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-09 06:27:16,866] [INFO] [runner.py:550:main] cmd = /data/anaconda3/envs/ljy_lmflow/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None examples/finetune.py --model_name_or_path gpt2 --dataset_path data/MedQA-USMLE/train --output_dir output_models --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 2e-5 --block_size 512 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --bf16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 60 --save_steps 5000 --dataloader_num_workers 1 [2023-04-09 06:27:18,342] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1]} [2023-04-09 06:27:18,343] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0 [2023-04-09 06:27:18,343] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2023-04-09 06:27:18,343] [INFO] [launch.py:162:main] dist_world_size=2 [2023-04-09 06:27:18,343] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2023-04-09 06:27:20,772] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 04/09/2023 06:27:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 04/09/2023 06:27:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False 04/09/2023 06:27:23 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-dab165c44cd11ccd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 04/09/2023 06:27:23 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-dab165c44cd11ccd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)

research4pan commented 1 year ago

Thanks for your interest in LMFlow! Could you please check log/finetune/train.err to see the detailed error message? Also, it would be nice if you could provide the hardware settings of your server, such as RAM, number of CPUs. My guess is that the problem was caused by hardware resource issues. Thanks 😄

xiaotingyun commented 1 year ago

Thanks for your interest in LMFlow! Could you please check log/finetune/train.err to see the detailed error message? Also, it would be nice if you could provide the hardware settings of your server, such as RAM, number of CPUs. My guess is that the problem was caused by hardware resource issues. Thanks 😄

There is no information in train.err, I think the hardware condition is enough, the following is the information displayed by some commands cat /proc/meminfo MemTotal: 131945300 kB MemFree: 7331668 kB MemAvailable: 117447888 kB Buffers: 3970796 kB Cached: 103898576 kB SwapCached: 257892 kB Active: 71319500 kB Inactive: 48543188 kB Active(anon): 11543416 kB Inactive(anon): 483164 kB Active(file): 59776084 kB Inactive(file): 48060024 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 31457276 kB SwapFree: 20853968 kB Dirty: 456 kB Writeback: 0 kB AnonPages: 11918664 kB Mapped: 1486400 kB Shmem: 33180 kB Slab: 3942532 kB SReclaimable: 3337628 kB SUnreclaim: 604904 kB KernelStack: 22688 kB PageTables: 85272 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 97429924 kB Committed_AS: 30086144 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 6308116 kB DirectMap2M: 123576320 kB DirectMap1G: 5242880 kB

cat /proc/cpuinfo model name : AMD EPYC 7282 16-Core Processor stepping : 0 microcode : 0x830104d cpu MHz : 1499.763 cache size : 512 KB physical id : 0 siblings : 32 core id : 3 cpu cores : 16 apicid : 7 initial apicid : 7 fpu : yes

research4pan commented 1 year ago

Thanks for providing the detailed information!

For single-GPU card issues, my guess is that it was caused by OOM (out of RAM). Since processing dataset and loading model normally requires much RAM. You may try following actions to see if it works:

For multiple-GPU card issues, we have experience similar problems before. Normally the stuck is caused by multi-process competition and will be resolved automatically after several minutes. If the amount of RAM is not enough, this competition may become severe and results in longer stucking time. In short, make sure that amount of RAM is sufficient, and waiting for several minutes should resolve the problem.

Hope that solves your problem. Thanks 😄

shizhediao commented 1 year ago

This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks

Nehe12 commented 10 months ago

I have this error in my code [2023-11-08 12:23:44,484] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [W socket.cpp:663] [c10d] The client socket has failed to connect to [CC]:29500 (system error: 10049 - La direcci¾n solicitada no es vßlida en este contexto.). [W socket.cpp:663] [c10d] The client socket has failed to connect to [CC]:29500 (system error: 10049 - La direcci¾n solicitada no es vßlida en este contexto.).

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Traceback (most recent call last): File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\example_chat_completion.py", line 104, in fire.Fire(main) File "C:\Python311\Lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\example_chat_completion.py", line 35, in main generator = Llama.build( ^^^^^^^^^^^^ File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\llama\generation.py", line 92, in build torch.cuda.set_device(local_rank) File "C:\Python311\Lib\site-packages\torch\cuda__init.py", line 404, in set_device torch._C._cuda_setDevice(device) ^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: module 'torch._C' has no attribute '_cuda_setDevice' [2023-11-08 12:23:49,537] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 22516) of binary: C:\Python311\python.exe Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "C:\Python311\Scripts\torchrun.exe\main.py", line 7, in File "C:\Python311\Lib\site-packages\torch\distributed\elastic\multiprocessing\errors\init__.py", line 346, in wrapper return f(args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\torch\distributed\run.py", line 806, in main run(args) File "C:\Python311\Lib\site-packages\torch\distributed\run.py", line 797, in run elastic_launch( File "C:\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_chat_completion.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-11-08_12:23:49 host : CC rank : 0 (local_rank: 0) exitcode : 1 (pid: 22516) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html