QwenLM / Qwen2.5

Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud.
9.69k stars 598 forks source link

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected). #388

Closed sjlmg closed 6 months ago

sjlmg commented 6 months ago

我添加了自己的数据集后出现该错误,我检查了代码,已有这些设置,应该如何处理呢

jklj077 commented 6 months ago

Hi! Can you share a reproduction script? It appears that the preprocessing function does not work as expected.

sjlmg commented 6 months ago

I haven't changed the script, I'm using the project's fintune.py. I changed the format of the data set as follows:

{"type": "chatml", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What locates organ or tissue function (The specific role or task performed by a distinct part of an organism's body (organ) or a group of similar cells working together (tissue) to maintain the overall functioning of the organism.)?"}, {"role": "assistant", "content": "gene or genome"}], "source": "unknown"}

this is the fintune.sh

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

# Guide:
# This script supports distributed training on multi-gpu workers (as well as single-worker training).
# Please set the options below according to the comments.
# For multi-gpu workers training, these options should be manually set for each worker.
# After setting the options, please run the script on each worker.

# Number of GPUs per GPU worker
GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')

# Number of GPU workers, for single-worker training, please set to 1
NNODES=${NNODES:-1}

# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
NODE_RANK=${NODE_RANK:-0}

# The ip address of the rank-0 worker, for single-worker training, please set to localhost
MASTER_ADDR=${MASTER_ADDR:-localhost}

# The port for communication
MASTER_PORT=${MASTER_PORT:-6001}

MODEL="Qwen/Qwen1.5-0.5B-Chat" # Set the path if you do not want to load from huggingface directly
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="/openbayes/home/Qwen1.5/examples/sft/umls/umls_train.jsonl"
DS_CONFIG_PATH="ds_config_zero3.json"
USE_LORA=True
Q_LORA=False
...

the detail of the error:

(base) root@luojg530428-mrtxmioiq5xj-main:/openbayes/home/Qwen1.5/examples/sft# bash finetune.sh 
[2024-05-09 06:22:31,306] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /root/.triton/autotune: No such file or directory
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
/usr/local/lib/python3.8/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-05-09 06:22:33,738] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-09 06:22:33,738] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/usr/local/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[2024-05-09 06:22:36,902] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 291, num_elems = 0.62B
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 30,277,632 || all params: 494,265,344 || trainable%: 6.12578493870693
Loading data...
Formatting inputs...Skip in lazy mode
Using /root/.cache/torch_extensions/py38_cu118 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py38_cu118/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu118/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/site-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/site-packages/torch/include -isystem /usr/local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/site-packages/torch/include/TH -isystem /usr/local/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -std=c++17 -c /usr/local/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/site-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/site-packages/torch/include -isystem /usr/local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/site-packages/torch/include/TH -isystem /usr/local/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /usr/local/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o 
[3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/usr/local/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_adam.so
Loading extension module fused_adam...
Time to load fused_adam op: 23.607194900512695 seconds
Parameter Offload: Total persistent parameters: 123904 in 121 params
  0%|                                                                                                                                                                                          | 0/3260 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 759, in convert_to_tensors
    tensor = as_tensor(value)
  File "/usr/local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 721, in as_tensor
    return torch.tensor(value)
ValueError: expected sequence of length 55 at dim 1 (got 59)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "finetune.py", line 378, in <module>
    train()
  File "finetune.py", line 369, in train
    trainer.train()
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1859, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2165, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/usr/local/lib/python3.8/site-packages/accelerate/data_loader.py", line 454, in __iter__
    current_batch = next(dataloader_iter)
  File "/usr/local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer_utils.py", line 808, in __call__
    return self.data_collator(features)
  File "/usr/local/lib/python3.8/site-packages/transformers/data/data_collator.py", line 271, in __call__
    batch = pad_without_fast_tokenizer_warning(
  File "/usr/local/lib/python3.8/site-packages/transformers/data/data_collator.py", line 66, in pad_without_fast_tokenizer_warning
    padded = tokenizer.pad(*pad_args, **pad_kwargs)
  File "/usr/local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3355, in pad
    return BatchEncoding(batch_outputs, tensor_type=return_tensors)
  File "/usr/local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 224, in __init__
    self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
  File "/usr/local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 775, in convert_to_tensors
    raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
  0%|                                                                                                                                                                                          | 0/3260 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 760) of binary: /usr/local/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-09_06:23:09
  host      : luojg530428-mrtxmioiq5xj-main
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 760)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
jklj077 commented 6 months ago

Hi, thanks for the error logs!

ValueError: expected sequence of length 55 at dim 1 (got 59)

This is a similar issue to https://github.com/QwenLM/Qwen1.5/issues/362. For now, please try the workaround provided there.

sjlmg commented 6 months ago

Thank you very much for your reply, it works now!