OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.
https://optimalscale.github.io/LMFlow/
Apache License 2.0
8.28k stars 827 forks source link

`preprocessing_num_workers` can not use in `scripts/run_finetune.sh` #482

Open csyourui opened 1 year ago

csyourui commented 1 year ago

Describe the bug tokenizer map in hf_decoder_model use multi preprocessing_num_workers will return TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object

To Reproduce Steps to reproduce the behavior:

add --preprocessing_num_workers 20 \ to scripts/run_finetune.sh

#!/bin/bash
# Please run this script under ${project_id} in project directory of
#   https://github.com/shizhediao/llm-ft
#     COMMIT: d5fecf30ba8011067b10cf51fede53a5ab6574e4

deepspeed_args="--master_port=11000"      # Default argument
if [ $# -ge 1 ]; then
  deepspeed_args="$1"
fi

exp_id=finetune
project_dir=$(cd "$(dirname $0)"/..; pwd)
output_dir=${project_dir}/output_models/${exp_id}
log_dir=${project_dir}/log/${exp_id}

dataset_path=${project_dir}/data/alpaca/train

mkdir -p ${output_dir} ${log_dir}

deepspeed ${deepspeed_args} \
  examples/finetune.py \
    --model_name_or_path gpt2 \
    --dataset_path ${dataset_path} \
    --preprocessing_num_workers 20 \
    --output_dir ${output_dir} --overwrite_output_dir \
    --num_train_epochs 0.01 \
    --learning_rate 2e-5 \
    --block_size 512 \
    --per_device_train_batch_size 1 \
    --deepspeed configs/ds_config_zero3.json \
    --bf16 \
    --run_name finetune \
    --validation_split_percentage 0 \
    --logging_steps 20 \
    --do_train \
    --ddp_timeout 72000 \
    --save_steps 5000 \
    --dataloader_num_workers 1 \
    | tee ${log_dir}/train.log \
    2> ${log_dir}/train.err

just start:

./scripts/run_finetune.sh

Screenshots

(lmflow) root@dev:/data/dev/gpt/LMFlow# ./scripts/run_finetune.sh
[2023-06-09 15:13:18,610] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-09 15:13:19,605] [INFO] [runner.py:550:main] cmd = /root/miniconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path gpt2 --dataset_path /data/dev/gpt/LMFlow/data/alpaca/train --preprocessing_num_workers 20 --output_dir /data/dev/gpt/LMFlow/output_models/finetune --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 2e-5 --block_size 512 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --bf16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2023-06-09 15:13:21,237] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7
[2023-06-09 15:13:21,237] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1
[2023-06-09 15:13:21,237] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.13.4-1
[2023-06-09 15:13:21,237] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-06-09 15:13:21,237] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7
[2023-06-09 15:13:21,237] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-06-09 15:13:21,237] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1
[2023-06-09 15:13:21,237] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-06-09 15:13:21,237] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-06-09 15:13:21,237] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-06-09 15:13:21,237] [INFO] [launch.py:162:main] dist_world_size=8
[2023-06-09 15:13:21,237] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-06-09 15:13:28,841] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
06/09/2023 15:13:29 - WARNING - lmflow.pipeline.finetuner - Process rank: 2, device: cuda:2, n_gpu: 1,distributed training: True, 16-bits training: False
06/09/2023 15:13:29 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: False
06/09/2023 15:13:30 - WARNING - lmflow.pipeline.finetuner - Process rank: 4, device: cuda:4, n_gpu: 1,distributed training: True, 16-bits training: False
06/09/2023 15:13:30 - WARNING - lmflow.pipeline.finetuner - Process rank: 5, device: cuda:5, n_gpu: 1,distributed training: True, 16-bits training: False
06/09/2023 15:13:30 - WARNING - lmflow.pipeline.finetuner - Process rank: 7, device: cuda:7, n_gpu: 1,distributed training: True, 16-bits training: False
06/09/2023 15:13:30 - WARNING - lmflow.pipeline.finetuner - Process rank: 1, device: cuda:1, n_gpu: 1,distributed training: True, 16-bits training: False
06/09/2023 15:13:30 - WARNING - lmflow.pipeline.finetuner - Process rank: 3, device: cuda:3, n_gpu: 1,distributed training: True, 16-bits training: False
06/09/2023 15:13:30 - WARNING - lmflow.pipeline.finetuner - Process rank: 6, device: cuda:6, n_gpu: 1,distributed training: True, 16-bits training: False
06/09/2023 15:13:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-0dfe5723824151c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
06/09/2023 15:13:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-0dfe5723824151c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
06/09/2023 15:13:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-0dfe5723824151c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
06/09/2023 15:13:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-0dfe5723824151c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
06/09/2023 15:13:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-0dfe5723824151c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
06/09/2023 15:13:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-0dfe5723824151c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
06/09/2023 15:13:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-0dfe5723824151c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
06/09/2023 15:13:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-0dfe5723824151c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
[2023-06-09 15:13:49,650] [INFO] [partition_parameters.py:415:__exit__] finished initializing model with 0.16B parameters
/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
Traceback (most recent call last):                                                                   
  File "/data/dev/gpt/LMFlow/examples/finetune.py", line 61, in <module>
    main()
  File "/data/dev/gpt/LMFlow/examples/finetune.py", line 57, in main
    tuned_model = finetuner.tune(model=model, dataset=dataset)
  File "/data/dev/gpt/LMFlow/src/lmflow/pipeline/finetuner.py", line 210, in tune
    tokenized_dataset = model.tokenize(dataset)
  File "/data/dev/gpt/LMFlow/src/lmflow/models/hf_decoder_model.py", line 432, in tokenize
    tokenized_datasets = raw_datasets.map(
  File "/data/dev/gpt/LMFlow/src/lmflow/datasets/dataset.py", line 323, in map
    mapped_backend_dataset = self.backend_dataset.map(*args, **kwargs)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3046, in map
    for rank, done, content in iflatmap_unordered(
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1373, in iflatmap_unordered
    [async_result.get() for async_result in async_results]
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1373, in <listcomp>
    [async_result.get() for async_result in async_results]
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/multiprocess/pool.py", line 537, in _handle_tasks
    put(task)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/multiprocess/connection.py", line 214, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/multiprocess/reduction.py", line 54, in dumps
    cls(buf, protocol, *args, **kwds).dump(obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 498, in dump
    StockPickler.dump(self, obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 487, in dump
    self.save(obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 901, in save_tuple
    save(element)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 886, in save_tuple
    save(element)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 1493, in save_function
    pickler.save_reduce(_create_function, (obj.__code__,
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 692, in save_reduce
    save(args)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 901, in save_tuple
    save(element)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 901, in save_tuple
    save(element)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 1227, in save_cell
    pickler.save_reduce(_create_cell, (f,), obj=obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 692, in save_reduce
    save(args)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 886, in save_tuple
    save(element)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 717, in save_reduce
    save(state)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 717, in save_reduce
    save(state)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 713, in save_reduce
    self._batch_setitems(dictitems)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 717, in save_reduce
    save(state)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 713, in save_reduce
    self._batch_setitems(dictitems)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 717, in save_reduce
    save(state)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 713, in save_reduce
    self._batch_setitems(dictitems)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 1002, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 692, in save_reduce
    save(args)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 901, in save_tuple
    save(element)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 578, in save
    rv = reduce(self.proto)
TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object
Running tokenizer on dataset (num_proc=20):   0%|                   | 0/52002 [00:00<?, ? examples/s][2023-06-09 15:14:13,505] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 142597
[2023-06-09 15:14:13,505] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 142598
[2023-06-09 15:14:14,680] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 142599
[2023-06-09 15:14:15,076] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 142600
[2023-06-09 15:14:15,394] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 142727
[2023-06-09 15:14:15,821] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 142728
[2023-06-09 15:14:16,254] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 142795
[2023-06-09 15:14:17,570] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 142856
[2023-06-09 15:14:18,084] [ERROR] [launch.py:324:sigkill_handler] ['/root/miniconda3/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=7', '--model_name_or_path', 'gpt2', '--dataset_path', '/data/dev/gpt/LMFlow/data/alpaca/train', '--preprocessing_num_workers', '20', '--output_dir', '/data/dev/gpt/LMFlow/output_models/finetune', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--block_size', '512', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1

package versions:

pip list
Package                  Version     Editable project location
------------------------ ----------- -----------------------------
absl-py                  1.4.0
accelerate               0.19.0
aiohttp                  3.8.4
aiosignal                1.3.1
antlr4-python3-runtime   4.9.3
appdirs                  1.4.4
async-timeout            4.0.2
attrs                    23.1.0
blinker                  1.6.2
certifi                  2023.5.7
chardet                  5.1.0
charset-normalizer       3.1.0
click                    8.1.3
cmake                    3.26.3
colorama                 0.4.6
cpm-kernels              1.0.11
DataProperty             0.55.1
datasets                 2.10.1
deepspeed                0.8.3
dill                     0.3.4
docker-pycreds           0.4.0
einops                   0.6.1
evaluate                 0.4.0
filelock                 3.12.0
flash-attn               1.0.4
Flask                    2.3.2
Flask-Cors               3.0.10
frozenlist               1.3.3
fsspec                   2023.5.0
gitdb                    4.0.10
GitPython                3.1.31
hjson                    3.1.0
huggingface-hub          0.14.1
icetk                    0.0.7
idna                     3.4
importlib-metadata       6.6.0
itsdangerous             2.1.2
Jinja2                   3.1.2
joblib                   1.2.0
jsonlines                3.1.0
lit                      16.0.3
lm-eval                  0.3.0
lmflow                   0.0.1       /data/dev/gpt/LMFlow/src
MarkupSafe               2.1.2
mbstrdecoder             1.1.2
mpi4py                   3.1.4
mpmath                   1.3.0
multidict                6.0.4
multiprocess             0.70.12.2
networkx                 3.1
ninja                    1.11.1
nltk                     3.8.1
numexpr                  2.8.4
numpy                    1.24.2
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-cupti-cu11   11.7.101
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.2.10.91
nvidia-cusolver-cu11     11.4.0.1
nvidia-cusparse-cu11     11.7.4.91
nvidia-nccl-cu11         2.14.3
nvidia-nvtx-cu11         11.7.91
omegaconf                2.3.0
openai                   0.27.6
packaging                23.1
pandas                   2.0.1
pathtools                0.1.2
pathvalidate             2.5.2
peft                     0.3.0.dev0
Pillow                   9.5.0
pip                      23.0.1
portalocker              2.7.0
protobuf                 3.18.3
psutil                   5.9.5
py-cpuinfo               9.0.0
pyarrow                  12.0.0
pybind11                 2.10.4
pycountry                22.3.5
pydantic                 1.10.7
pytablewriter            0.64.2
python-dateutil          2.8.2
pytz                     2023.3
PyYAML                   6.0
regex                    2023.5.5
requests                 2.30.0
responses                0.18.0
rouge-score              0.1.2
sacrebleu                1.5.0
scikit-learn             1.2.2
scipy                    1.10.1
sentencepiece            0.1.99
sentry-sdk               1.22.2
setproctitle             1.3.2
setuptools               66.0.0
six                      1.16.0
smmap                    5.0.0
sqlitedict               2.1.0
sympy                    1.12
tabledata                1.3.1
tcolorpy                 0.1.3
threadpoolctl            3.1.0
tokenizers               0.13.3
torch                    2.0.0
torchvision              0.15.1
tqdm                     4.65.0
tqdm-multiprocess        0.0.11
transformers             4.28.0.dev0
triton                   2.0.0
trl                      0.4.2.dev0
typepy                   1.3.0
typing_extensions        4.5.0
tzdata                   2023.3
urllib3                  1.26.15
wandb                    0.14.0
Werkzeug                 2.3.4
wheel                    0.38.4
xxhash                   3.2.0
yarl                     1.9.2
zipp                     3.15.0
zstandard                0.21.0
shizhediao commented 1 year ago

Hi, Is it working well without setting this?

csyourui commented 1 year ago

Thank you for your reply. It works well if I do not set preprocessing_num_workers. However, I am just curious about why it does not work when this parameter is added. Can you reproduce this problem? Or is it just an issue with my environment?

xiaozhu1106 commented 1 year ago

遇到同样问题,加载数据时,没办法并行处理加载

iseesaw commented 6 months ago

same question

wheresmyhair commented 5 months ago

遇到同样问题,加载数据时,没办法并行处理加载

FYI: We've located the bug, and dev team needs to perform a small-scale refactoring to fix. We will do ASAP and sorry for the inconvenience 🙏

wheresmyhair commented 5 months ago

遇到同样问题,加载数据时,没办法并行处理加载

FYI: Bug fixed, please see https://github.com/OptimalScale/LMFlow/pull/845 🤗