OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.
https://optimalscale.github.io/LMFlow/
Apache License 2.0
8.12k stars 819 forks source link

`preprocessing_num_workers` can not use in `scripts/run_finetune.sh` #482

Open csyourui opened 1 year ago

csyourui commented 1 year ago

Describe the bug tokenizer map in hf_decoder_model use multi preprocessing_num_workers will return TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object

To Reproduce Steps to reproduce the behavior:

add --preprocessing_num_workers 20 \ to scripts/run_finetune.sh

#!/bin/bash
# Please run this script under ${project_id} in project directory of
#   https://github.com/shizhediao/llm-ft
#     COMMIT: d5fecf30ba8011067b10cf51fede53a5ab6574e4

deepspeed_args="--master_port=11000"      # Default argument
if [ $# -ge 1 ]; then
  deepspeed_args="$1"
fi

exp_id=finetune
project_dir=$(cd "$(dirname $0)"/..; pwd)
output_dir=${project_dir}/output_models/${exp_id}
log_dir=${project_dir}/log/${exp_id}

dataset_path=${project_dir}/data/alpaca/train

mkdir -p ${output_dir} ${log_dir}

deepspeed ${deepspeed_args} \
  examples/finetune.py \
    --model_name_or_path gpt2 \
    --dataset_path ${dataset_path} \
    --preprocessing_num_workers 20 \
    --output_dir ${output_dir} --overwrite_output_dir \
    --num_train_epochs 0.01 \
    --learning_rate 2e-5 \
    --block_size 512 \
    --per_device_train_batch_size 1 \
    --deepspeed configs/ds_config_zero3.json \
    --bf16 \
    --run_name finetune \
    --validation_split_percentage 0 \
    --logging_steps 20 \
    --do_train \
    --ddp_timeout 72000 \
    --save_steps 5000 \
    --dataloader_num_workers 1 \
    | tee ${log_dir}/train.log \
    2> ${log_dir}/train.err

just start:

./scripts/run_finetune.sh

Screenshots

(lmflow) root@dev:/data/dev/gpt/LMFlow# ./scripts/run_finetune.sh
[2023-06-09 15:13:18,610] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-09 15:13:19,605] [INFO] [runner.py:550:main] cmd = /root/miniconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path gpt2 --dataset_path /data/dev/gpt/LMFlow/data/alpaca/train --preprocessing_num_workers 20 --output_dir /data/dev/gpt/LMFlow/output_models/finetune --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 2e-5 --block_size 512 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --bf16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2023-06-09 15:13:21,237] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7
[2023-06-09 15:13:21,237] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1
[2023-06-09 15:13:21,237] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.13.4-1
[2023-06-09 15:13:21,237] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-06-09 15:13:21,237] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7
[2023-06-09 15:13:21,237] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-06-09 15:13:21,237] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1
[2023-06-09 15:13:21,237] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-06-09 15:13:21,237] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-06-09 15:13:21,237] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-06-09 15:13:21,237] [INFO] [launch.py:162:main] dist_world_size=8
[2023-06-09 15:13:21,237] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-06-09 15:13:28,841] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
06/09/2023 15:13:29 - WARNING - lmflow.pipeline.finetuner - Process rank: 2, device: cuda:2, n_gpu: 1,distributed training: True, 16-bits training: False
06/09/2023 15:13:29 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: False
06/09/2023 15:13:30 - WARNING - lmflow.pipeline.finetuner - Process rank: 4, device: cuda:4, n_gpu: 1,distributed training: True, 16-bits training: False
06/09/2023 15:13:30 - WARNING - lmflow.pipeline.finetuner - Process rank: 5, device: cuda:5, n_gpu: 1,distributed training: True, 16-bits training: False
06/09/2023 15:13:30 - WARNING - lmflow.pipeline.finetuner - Process rank: 7, device: cuda:7, n_gpu: 1,distributed training: True, 16-bits training: False
06/09/2023 15:13:30 - WARNING - lmflow.pipeline.finetuner - Process rank: 1, device: cuda:1, n_gpu: 1,distributed training: True, 16-bits training: False
06/09/2023 15:13:30 - WARNING - lmflow.pipeline.finetuner - Process rank: 3, device: cuda:3, n_gpu: 1,distributed training: True, 16-bits training: False
06/09/2023 15:13:30 - WARNING - lmflow.pipeline.finetuner - Process rank: 6, device: cuda:6, n_gpu: 1,distributed training: True, 16-bits training: False
06/09/2023 15:13:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-0dfe5723824151c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
06/09/2023 15:13:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-0dfe5723824151c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
06/09/2023 15:13:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-0dfe5723824151c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
06/09/2023 15:13:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-0dfe5723824151c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
06/09/2023 15:13:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-0dfe5723824151c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
06/09/2023 15:13:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-0dfe5723824151c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
06/09/2023 15:13:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-0dfe5723824151c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
06/09/2023 15:13:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-0dfe5723824151c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
[2023-06-09 15:13:49,650] [INFO] [partition_parameters.py:415:__exit__] finished initializing model with 0.16B parameters
/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
Traceback (most recent call last):                                                                   
  File "/data/dev/gpt/LMFlow/examples/finetune.py", line 61, in <module>
    main()
  File "/data/dev/gpt/LMFlow/examples/finetune.py", line 57, in main
    tuned_model = finetuner.tune(model=model, dataset=dataset)
  File "/data/dev/gpt/LMFlow/src/lmflow/pipeline/finetuner.py", line 210, in tune
    tokenized_dataset = model.tokenize(dataset)
  File "/data/dev/gpt/LMFlow/src/lmflow/models/hf_decoder_model.py", line 432, in tokenize
    tokenized_datasets = raw_datasets.map(
  File "/data/dev/gpt/LMFlow/src/lmflow/datasets/dataset.py", line 323, in map
    mapped_backend_dataset = self.backend_dataset.map(*args, **kwargs)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3046, in map
    for rank, done, content in iflatmap_unordered(
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1373, in iflatmap_unordered
    [async_result.get() for async_result in async_results]
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1373, in <listcomp>
    [async_result.get() for async_result in async_results]
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/multiprocess/pool.py", line 537, in _handle_tasks
    put(task)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/multiprocess/connection.py", line 214, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/multiprocess/reduction.py", line 54, in dumps
    cls(buf, protocol, *args, **kwds).dump(obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 498, in dump
    StockPickler.dump(self, obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 487, in dump
    self.save(obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 901, in save_tuple
    save(element)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 886, in save_tuple
    save(element)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 1493, in save_function
    pickler.save_reduce(_create_function, (obj.__code__,
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 692, in save_reduce
    save(args)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 901, in save_tuple
    save(element)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 901, in save_tuple
    save(element)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 1227, in save_cell
    pickler.save_reduce(_create_cell, (f,), obj=obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 692, in save_reduce
    save(args)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 886, in save_tuple
    save(element)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 717, in save_reduce
    save(state)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 717, in save_reduce
    save(state)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 713, in save_reduce
    self._batch_setitems(dictitems)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 717, in save_reduce
    save(state)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 713, in save_reduce
    self._batch_setitems(dictitems)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 717, in save_reduce
    save(state)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 713, in save_reduce
    self._batch_setitems(dictitems)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 1002, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 692, in save_reduce
    save(args)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 901, in save_tuple
    save(element)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/pickle.py", line 578, in save
    rv = reduce(self.proto)
TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object
Running tokenizer on dataset (num_proc=20):   0%|                   | 0/52002 [00:00<?, ? examples/s][2023-06-09 15:14:13,505] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 142597
[2023-06-09 15:14:13,505] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 142598
[2023-06-09 15:14:14,680] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 142599
[2023-06-09 15:14:15,076] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 142600
[2023-06-09 15:14:15,394] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 142727
[2023-06-09 15:14:15,821] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 142728
[2023-06-09 15:14:16,254] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 142795
[2023-06-09 15:14:17,570] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 142856
[2023-06-09 15:14:18,084] [ERROR] [launch.py:324:sigkill_handler] ['/root/miniconda3/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=7', '--model_name_or_path', 'gpt2', '--dataset_path', '/data/dev/gpt/LMFlow/data/alpaca/train', '--preprocessing_num_workers', '20', '--output_dir', '/data/dev/gpt/LMFlow/output_models/finetune', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--block_size', '512', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1

package versions:

pip list
Package                  Version     Editable project location
------------------------ ----------- -----------------------------
absl-py                  1.4.0
accelerate               0.19.0
aiohttp                  3.8.4
aiosignal                1.3.1
antlr4-python3-runtime   4.9.3
appdirs                  1.4.4
async-timeout            4.0.2
attrs                    23.1.0
blinker                  1.6.2
certifi                  2023.5.7
chardet                  5.1.0
charset-normalizer       3.1.0
click                    8.1.3
cmake                    3.26.3
colorama                 0.4.6
cpm-kernels              1.0.11
DataProperty             0.55.1
datasets                 2.10.1
deepspeed                0.8.3
dill                     0.3.4
docker-pycreds           0.4.0
einops                   0.6.1
evaluate                 0.4.0
filelock                 3.12.0
flash-attn               1.0.4
Flask                    2.3.2
Flask-Cors               3.0.10
frozenlist               1.3.3
fsspec                   2023.5.0
gitdb                    4.0.10
GitPython                3.1.31
hjson                    3.1.0
huggingface-hub          0.14.1
icetk                    0.0.7
idna                     3.4
importlib-metadata       6.6.0
itsdangerous             2.1.2
Jinja2                   3.1.2
joblib                   1.2.0
jsonlines                3.1.0
lit                      16.0.3
lm-eval                  0.3.0
lmflow                   0.0.1       /data/dev/gpt/LMFlow/src
MarkupSafe               2.1.2
mbstrdecoder             1.1.2
mpi4py                   3.1.4
mpmath                   1.3.0
multidict                6.0.4
multiprocess             0.70.12.2
networkx                 3.1
ninja                    1.11.1
nltk                     3.8.1
numexpr                  2.8.4
numpy                    1.24.2
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-cupti-cu11   11.7.101
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.2.10.91
nvidia-cusolver-cu11     11.4.0.1
nvidia-cusparse-cu11     11.7.4.91
nvidia-nccl-cu11         2.14.3
nvidia-nvtx-cu11         11.7.91
omegaconf                2.3.0
openai                   0.27.6
packaging                23.1
pandas                   2.0.1
pathtools                0.1.2
pathvalidate             2.5.2
peft                     0.3.0.dev0
Pillow                   9.5.0
pip                      23.0.1
portalocker              2.7.0
protobuf                 3.18.3
psutil                   5.9.5
py-cpuinfo               9.0.0
pyarrow                  12.0.0
pybind11                 2.10.4
pycountry                22.3.5
pydantic                 1.10.7
pytablewriter            0.64.2
python-dateutil          2.8.2
pytz                     2023.3
PyYAML                   6.0
regex                    2023.5.5
requests                 2.30.0
responses                0.18.0
rouge-score              0.1.2
sacrebleu                1.5.0
scikit-learn             1.2.2
scipy                    1.10.1
sentencepiece            0.1.99
sentry-sdk               1.22.2
setproctitle             1.3.2
setuptools               66.0.0
six                      1.16.0
smmap                    5.0.0
sqlitedict               2.1.0
sympy                    1.12
tabledata                1.3.1
tcolorpy                 0.1.3
threadpoolctl            3.1.0
tokenizers               0.13.3
torch                    2.0.0
torchvision              0.15.1
tqdm                     4.65.0
tqdm-multiprocess        0.0.11
transformers             4.28.0.dev0
triton                   2.0.0
trl                      0.4.2.dev0
typepy                   1.3.0
typing_extensions        4.5.0
tzdata                   2023.3
urllib3                  1.26.15
wandb                    0.14.0
Werkzeug                 2.3.4
wheel                    0.38.4
xxhash                   3.2.0
yarl                     1.9.2
zipp                     3.15.0
zstandard                0.21.0
shizhediao commented 1 year ago

Hi, Is it working well without setting this?

csyourui commented 1 year ago

Thank you for your reply. It works well if I do not set preprocessing_num_workers. However, I am just curious about why it does not work when this parameter is added. Can you reproduce this problem? Or is it just an issue with my environment?

xiaozhu1106 commented 11 months ago

遇到同样问题,加载数据时,没办法并行处理加载

iseesaw commented 1 month ago

same question

wheresmyhair commented 1 month ago

遇到同样问题,加载数据时,没办法并行处理加载

FYI: We've located the bug, and dev team needs to perform a small-scale refactoring to fix. We will do ASAP and sorry for the inconvenience 🙏

wheresmyhair commented 1 month ago

遇到同样问题,加载数据时,没办法并行处理加载

FYI: Bug fixed, please see https://github.com/OptimalScale/LMFlow/pull/845 🤗