THUDM / ChatGLM2-6B

ChatGLM2-6B: An Open Bilingual Chat LLM | 开源双语对话语言模型
Other
15.65k stars 1.85k forks source link

[BUG/Help] <title>按照官方给出的多轮问答数据集构建问答数据之后,运行脚本命令出现Traceback (most recent call last): File "/mnt/ChatGLM2-6B/ptuning/main.py", line 411, in <module> main() File "/mnt/ChatGLM2-6B/ptuning/main.py", line 229, in main train_dataset = train_dataset.map( File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3180, in map with Pool(len(kwargs_per_job)) as pool: #666

Open nevesaynever1 opened 5 months ago

nevesaynever1 commented 5 months ago

Is there an existing issue for this?

Current Behavior

数据集构建格式为:{"content": "长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "summaty": "用电脑能读数据流吗?水温多少", "history": []} {"content": "95", "summaty": "上下水管温差怎么样啊?空气是不是都排干净了呢?", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗?水温多少"]]} {"content": "是的。上下水管都好的", "summaty": "那就要检查线路了,一般风扇继电器是由电脑控制吸合的,如果电路存在断路,或者电脑坏了的话会出现继电器不吸合的情况!", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗?水温多少"], ["95", "上下水管温差怎么样啊?空气是不是都排干净了呢?"]]} {"content": "你好", "summaty": "你好", "history": []} {"content": "请问你是谁?", "summaty": "我是chatGLm2-6B", "history": [["你好", "你好"]]} {"content": "你是一个优秀的人工智能助手吗?", "summaty": "是的,我是。", "history": [["你好", "你好"], ["请问你是谁?","我是chatGLm2-6B"]]}

train.sh脚本文件为: PRE_SEQ_LEN=128 LR=1e-2

CUDA_VISIBLE_DEVICES=0 python main.py \ --do_train \ --train_file train.json \ --validation_file dev.json \ --preprocessing_num_workers 10 \ --prompt_column content \ --response_column summary \ --history_column history \ --overwrite_cache \ --model_name_or_path THUDM/chatglm2-6b \ --output_dir ../output \ --overwrite_output_dir \ --max_source_length 256 \ --max_target_length 256 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 16 \ --predict_with_generate \ --max_steps 3000 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate $LR \ --pre_seq_len $PRE_SEQ_LEN \

运行上述脚本出现下述错误: Traceback (most recent call last): File "/mnt/ChatGLM2-6B/ptuning/main.py", line 411, in main() File "/mnt/ChatGLM2-6B/ptuning/main.py", line 229, in main train_dataset = train_dataset.map( File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, **kwargs) File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3180, in map with Pool(len(kwargs_per_job)) as pool: 时数据格式的原因吗,排查了一下,和官方构建的数据格式是一样的。

Expected Behavior

No response

Steps To Reproduce

数据集构建格式为:{"content": "长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "summaty": "用电脑能读数据流吗?水温多少", "history": []} {"content": "95", "summaty": "上下水管温差怎么样啊?空气是不是都排干净了呢?", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗?水温多少"]]} {"content": "是的。上下水管都好的", "summaty": "那就要检查线路了,一般风扇继电器是由电脑控制吸合的,如果电路存在断路,或者电脑坏了的话会出现继电器不吸合的情况!", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗?水温多少"], ["95", "上下水管温差怎么样啊?空气是不是都排干净了呢?"]]} {"content": "你好", "summaty": "你好", "history": []} {"content": "请问你是谁?", "summaty": "我是chatGLm2-6B", "history": [["你好", "你好"]]} {"content": "你是一个优秀的人工智能助手吗?", "summaty": "是的,我是。", "history": [["你好", "你好"], ["请问你是谁?","我是chatGLm2-6B"]]}

train.sh脚本文件为: PRE_SEQ_LEN=128 LR=1e-2

CUDA_VISIBLE_DEVICES=0 python main.py \ --do_train \ --train_file train.json \ --validation_file dev.json \ --preprocessing_num_workers 10 \ --prompt_column content \ --response_column summary \ --history_column history \ --overwrite_cache \ --model_name_or_path THUDM/chatglm2-6b \ --output_dir ../output \ --overwrite_output_dir \ --max_source_length 256 \ --max_target_length 256 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 16 \ --predict_with_generate \ --max_steps 3000 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate $LR \ --pre_seq_len $PRE_SEQ_LEN \

运行上述脚本出现下述错误: Traceback (most recent call last): File "/mnt/ChatGLM2-6B/ptuning/main.py", line 411, in main() File "/mnt/ChatGLM2-6B/ptuning/main.py", line 229, in main train_dataset = train_dataset.map( File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, **kwargs) File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3180, in map with Pool(len(kwargs_per_job)) as pool: 时数据格式的原因吗,排查了一下,和官方构建的数据格式是一样的。

Environment

- OS: centos
- Python:3.9
- Transformers:4.30.2
- PyTorch:2.0.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : CU118

Anything else?

No response