OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.
https://optimalscale.github.io/LMFlow/
Apache License 2.0
8.11k stars 819 forks source link

[BUG]when map the dataset, i set the num_proc = 2 or 4, it will make mistakes. #840

Open nicosouth opened 1 month ago

nicosouth commented 1 month ago

Running tokenizer on dataset (num_proc=2): 0%| | 0/666 00:00<?, ? examples/s: Traceback (most recent call last): rank0: File "/data/mnt/LMFlow-20240514/examples/finetune.py", line 61, in

rank0: File "/data/mnt/LMFlow-20240514/examples/finetune.py", line 57, in main rank0: tuned_model = finetuner.tune(model=model, dataset=dataset) rank0: File "/data/mnt/LMFlow-20240514/src/lmflow/pipeline/finetuner.py", line 237, in tune rank0: tokenized_dataset = model.tokenize(dataset) rank0: File "/data/mnt/LMFlow-20240514/src/lmflow/models/hf_decoder_model.py", line 622, in tokenize rank0: tokenized_datasets = raw_datasets.map( rank0: File "/data/mnt/LMFlow-20240514/src/lmflow/datasets/dataset.py", line 371, in map rank0: mapped_backend_dataset = self.backend_dataset.map(*args, kwargs) rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper rank0: out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper rank0: out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3189, in map rank0: for rank, done, content in iflatmap_unordered( rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1394, in iflatmap_unordered rank0: async_result.get(timeout=0.05) for async_result in async_results: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1394, in rank0: async_result.get(timeout=0.05) for async_result in async_results: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get rank0: raise self._value rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/pool.py", line 537, in _handle_tasks

rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/connection.py", line 214, in send

rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/reduction.py", line 54, in dumps rank0: cls(buf, protocol, *args, **kwds).dump(obj) rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 498, in dump rank0: StockPickler.dump(self, obj) rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 487, in dump

rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save rank0: f(self, obj) # Call unbound method with explicit self rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 901, in save_tuple

rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save rank0: f(self, obj) # Call unbound method with explicit self rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 886, in save_tuple

rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save rank0: f(self, obj) # Call unbound method with explicit self rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict rank0: StockPickler.save_dict(pickler, obj) rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 971, in save_dict

rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 997, in _batch_setitems

rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save rank0: f(self, obj) # Call unbound method with explicit self rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 1493, in save_function rank0: pickler.save_reduce(_create_function, (obj.code, rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 692, in save_reduce

rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save rank0: f(self, obj) # Call unbound method with explicit self rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 901, in save_tuple

rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save rank0: f(self, obj) # Call unbound method with explicit self rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 901, in save_tuple

rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save rank0: f(self, obj) # Call unbound method with explicit self rank0: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 1226, in save_cell rank0: f = obj.cell_contents rank0: ValueError: Cell is empty

wheresmyhair commented 1 month ago

Thanks for your interest in LMFlow! Could you please provide your .sh script? Also, what kind of dataset are you using?

nicosouth commented 1 month ago

ok, this is my script, i just add the "--preprocessing_num_workers 4"

""""""""" model_name_or_path=/home/llm/model/Qwen1.5-1.8B dataset_path=/home/llm/data/text_test/ output_dir=/home/llm/model/output_models/finetune conversation_template=empty trust_remote_code=True

while [[ $# -ge 1 ]]; do key="$1" case ${key} in -m|--model_name_or_path) model_name_or_path="$2" shift ;; -d|--dataset_path) dataset_path="$2" shift ;; -o|--output_model_path) output_dir="$2" shift ;; --conversation_template) conversation_template="$2" shift ;; --deepspeed_args) deepspeed_args="$2" shift ;; --trust_remote_code) trust_remote_code="$2" shift ;; *) echo "error: unknown option \"${key}\"" 1>&2 exit 1 esac shift done

deepspeed --include="localhost:5" --master_port=11999 \ examples/finetune.py \ --model_name_or_path ${model_name_or_path} \ --trust_remote_code ${trust_remote_code} \ --dataset_path ${dataset_path} \ --output_dir ${output_dir} \ --conversation_template ${conversation_template} \ --num_train_epochs 1 \ --learning_rate 2e-5 \ --disable_group_texts 1 \ --block_size 1024 \ --per_device_train_batch_size 1 \ --deepspeed configs/ds_config_zero0.json \ --bf16 \ --run_name finetune \ --validation_split_percentage 0 \ --logging_steps 20 \ --do_train \ --ddp_timeout 72000 \ --save_steps 5000 \ --dataloader_num_workers 1 \ --preprocessing_num_workers 4 \ | tee ${log_dir}/train.log \ 2> ${log_dir}/train.err """""""""

i use the ShuSheng dataset and convert data into the format required by lmflow.

thank you!

wheresmyhair commented 1 month ago

i use the ShuSheng dataset and convert data into the format required by lmflow.

What's the type of that dataset, is it text_only, text2text, or conversation?

nicosouth commented 1 month ago

i use the ShuSheng dataset and convert data into the format required by lmflow.

What's the type of that dataset, is it text_only, text2text, or conversation?

it's text_only.

wheresmyhair commented 1 month ago

i use the ShuSheng dataset and convert data into the format required by lmflow.

What's the type of that dataset, is it text_only, text2text, or conversation?

it's text_only.

We do repro this bug now and we are working on fixing it. Perhaps finetune with --preprocessing_num_workers 1 for now, and sorry for the inconvenience πŸ™ If you have any other questions, please feel free to leave a comment.

nicosouth commented 1 month ago

thank you for your contributions

wheresmyhair commented 1 month ago

thank you for your contributions

FYI: We've located the bug, and dev team needs to perform a small-scale refactoring to fix. We will do ASAP and sorry for the inconvenience πŸ™

wheresmyhair commented 4 weeks ago

thank you for your contributions

FYI: Bug fixed, please see https://github.com/OptimalScale/LMFlow/pull/845 πŸ€—