Open hongWin opened 7 months ago
我也有相同的问题。
@tomorrow-zy dbgpt_hub/llm_base/load_tokenizer.py, line 179 right改成left
这样训练之后推理的时候会出现 inf 的情况 不知道与这个有无关系
---- 回复的原邮件 ---- | 发件人 | @.> | | 发送日期 | 2024年05月22日 09:29 | | 收件人 | eosphoros-ai/DB-GPT-Hub @.> | | 抄送人 | Zzzzz @.>, Mention @.> | | 主题 | Re: [eosphoros-ai/DB-GPT-Hub] 我在加载数据集时,出现断言错误,请问如何解决?目前使用glm3模型,模型已经导入,目前排查出错在语句dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, ",sft")后续无法排查。 (Issue #256) |
@tomorrow-zy dbgpt_hub/llm_base/load_tokenizer.py文件line 179,right改成left
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
这样训练之后推理的时候会出现 inf 的情况 不知道与这个有无关系 … ---- 回复的原邮件 ---- | 发件人 | @.> | | 发送日期 | 2024年05月22日 09:29 | | 收件人 | eosphoros-ai/DB-GPT-Hub @.> | | 抄送人 | Zzzzz @.>, Mention @.> | | 主题 | Re: [eosphoros-ai/DB-GPT-Hub] 我在加载数据集时,出现断言错误,请问如何解决?目前使用glm3模型,模型已经导入,目前排查出错在语句dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, ",sft")后续无法排查。 (Issue #256) | @tomorrow-zy dbgpt_hub/llm_base/load_tokenizer.py文件line 179,right改成left — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
看了下作者注释写了# training with left-padded tensors in fp16 precision may cause overflow 资源够的话可以试试fp32,或者对过长的序列max_length做截断
这样训练之后推理的时候会出现 inf 的情况 不知道与这个有无关系 … ---- 回复的原邮件 ---- | 发件人 | @.**> | | 发送日期 | 2024年05月22日 09:29 | | 收件人 | eosphoros-ai/DB-GPT-Hub @.**> | | 抄送人 | Zzzzz @.**>, Mention @.**> | | 主题 | Re: [eosphoros-ai/DB-GPT-Hub] 我在加载数据集时,出现断言错误,请问如何解决?目前使用glm3模型,模型已经导入,目前排查出错在语句dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, ",sft")后续无法排查。 (Issue #256) | @tomorrow-zy dbgpt_hub/llm_base/loadtokenizer.py文件line 179,right改成left — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @_.***>
看了下作者注释写了# training with left-padded tensors in fp16 precision may cause overflow 资源够的话可以试试fp32,或者对过长的序列max_length做截断
好的 感谢
04/12/2024 10:26:38 - INFO - dbgpt_hub.llm_base.adapter - Fine-tuning method: LoRA 04/12/2024 10:26:39 - INFO - dbgpt_hub.llm_base.load_tokenizer - trainable params: 15597568 || all params: 6259181568 || trainable%: 0.2492 Running tokenizer on dataset: 0%| | 0/8659 [00:00<?, ? examples/s] Traceback (most recent call last): File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\run_sft.py", line 79, in
start_sft(train_args)
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\train\sft_train_api.py", line 43, in start_sft
sft_train.train(args)
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\train\sft_train.py", line 144, in train
run_sft(
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\train\sft_train.py", line 53, in run_sft
dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, "sft")
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\data_process\data_utils.py", line 810, in preprocess_dataset
dataset = dataset.map(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\datasets\arrow_dataset.py", line 593, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs)
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\datasets\arrow_dataset.py", line 558, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs)
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\datasets\arrow_dataset.py", line 3105, in map
for rank, done, content in Dataset._map_single(dataset_kwargs):
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\datasets\arrow_dataset.py", line 3482, in _map_single
batch = apply_function_on_filtered_inputs(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\datasets\arrow_dataset.py", line 3361, in apply_function_on_filtered_inputs
processed_inputs = function(fn_args, additional_args, fn_kwargs)
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\data_process\data_utils.py", line 664, in preprocess_supervised_dataset
for source_ids, target_ids in template.encode_multiturn(
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\configs\data_args.py", line 270, in encode_multiturn
encoded_pairs = self._encode(tokenizer, system, history)
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\configs\data_args.py", line 321, in _encode
prefix_ids = self._convert_inputs_to_ids(
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\configs\data_args.py", line 368, in _convert_inputs_to_ids
token_ids = token_ids + tokenizer.encode(elem, kwargs)
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\transformers\tokenization_utils_base.py", line 2600, in encode
encoded_inputs = self.encode_plus(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\transformers\tokenization_utils_base.py", line 3008, in encode_plus
return self._encode_plus(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\transformers\tokenization_utils.py", line 722, in _encode_plus
return self.prepare_for_model(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\transformers\tokenization_utils_base.py", line 3487, in prepare_for_model
encoded_inputs = self.pad(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\transformers\tokenization_utils_base.py", line 3292, in pad
encoded_inputs = self._pad(
File "C:\Users\PC.cache\huggingface\modules\transformers_modules\glm3_Parameter\tokenization_chatglm.py", line 271, in _pad
assert self.padding_side == "left"
AssertionError