[BUG/Help] <训练时候报错AttributeError: 'ChatGLMTokenizer' object has no attribute 'build_prompt'>

ghost commented 11 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

训练时候报错如下

Traceback (most recent call last):
  File "E:\space_code\ChatGLM26Benv1\ptuning\main.py", line 411, in <module>
    main()
  File "E:\space_code\ChatGLM26Benv1\ptuning\main.py", line 229, in main
    train_dataset = train_dataset.map(
  File "D:\Users\Admin\miniconda3\envs\python39\lib\site-packages\datasets\arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "D:\Users\Admin\miniconda3\envs\python39\lib\site-packages\datasets\arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "D:\Users\Admin\miniconda3\envs\python39\lib\site-packages\datasets\arrow_dataset.py", line 3189, in map
    for rank, done, content in iflatmap_unordered(
  File "D:\Users\Admin\miniconda3\envs\python39\lib\site-packages\datasets\utils\py_utils.py", line 1387, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "D:\Users\Admin\miniconda3\envs\python39\lib\site-packages\datasets\utils\py_utils.py", line 1387, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "D:\Users\Admin\miniconda3\envs\python39\lib\site-packages\multiprocess\pool.py", line 771, in get
    raise self._value
AttributeError: 'ChatGLMTokenizer' object has no attribute 'build_prompt'

求指导

Expected Behavior

No response

Steps To Reproduce

1.windows环境，改写tran.sh如下

python main.py ^
    --do_train ^
    --train_file data/train.json  ^
    --validation_file data/dev.json  ^
    --preprocessing_num_workers 10  ^
    --prompt_column content  ^
    --response_column summary  ^
    --overwrite_cache  ^
    --model_name_or_path E:\\space_code\\ChatGLM26Benv1\\chatglm-6b  ^
    --output_dir output/adgen-chatglm2-6b-pt-128-$LR  ^
    --overwrite_output_dir ^
    --max_source_length 64 ^
    --max_target_length 128 ^
    --per_device_train_batch_size 1 ^
    --per_device_eval_batch_size 1 ^
    --gradient_accumulation_steps 16 ^
    --predict_with_generate ^
    --max_steps 3000 ^
    --logging_steps 10 ^
    --save_steps 1000 ^
    --learning_rate 2e-2 ^
    --pre_seq_len 128 ^
    --quantization_bit 4

2.train.json和dev.json内容如下

{"content": "今天新疆气温", "summary": "这是测试数据，新疆今天10度"}
{"content": "昨天新疆气温", "summary": "这是测试数据，新疆昨天20度"}

Environment

- OS: windows
- Python:3.9.17
- Transformers:4.30.2
- PyTorch:2.0.0+cu118
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True

Anything else?

No response

ng-fukgin commented 10 months ago

以前成功过吗，成功过就更新下tokenization_chatglm.py试试如果第一次运行，很有可能是模型的问题，重新下载模型试一下，

chenyihan0115 commented 7 months ago

是否解决，遇到一样的问题，AttributeError: 'ChatGLMTokenizer' object has no attribute 'build_prompt'

ttj666 commented 2 months ago

请问一下我在运行cli_demo.py报如下错误：AttributeError: 'ChatGLMTokenizer' object has no attribute 'sp_tokenizer'. Did you mean: '_tokenize'?，我也重新加载了模型还是这样，这是什么情况

ng-fukgin commented 2 months ago

请问一下我在运行cli_demo.py报如下错误：AttributeError: 'ChatGLMTokenizer' object has no attribute 'sp_tokenizer'. Did you mean: '_tokenize'?，我也重新加载了模型还是这样，这是什么情况

可能是transformers的问题，试一下 pip install transformers==4.33.0

WangYangfan commented 2 months ago

Resolved! The problem occurs because the self.sp_tokenizer is set after calling super.__init__(). Specifically, looking at the error information, it is found that super().__init__() calls the _add_tokens method in the parent class, which goes on to call the self.get_vocab method. The get_vocab method is overridden in the subclass ChatGLMTokenizer, and self.sp_tokenizer is used in the subclass's get_vocab method. However, at this time, self.sp_tokenizer is not defined.

the solution is set self.sp_tokenizer before super().__init__().

before

class ChatGLMTokenizer(PretrainedTokenizer):
    ...
    def __init__(...) -> None:
        super().__init__(...)
        ...
        self.sp_tokenizer = SPTokenizer(vocab_file, num_image_tokens=num_image_tokens)

after

class ChatGLMTokenizer(PretrainedTokenizer):
    ...
    def __init__(...) -> None:
        self.sp_tokenizer = SPTokenizer(vocab_file, num_image_tokens=num_image_tokens)
        super().__init__(...)
        ...
        # self.sp_tokenizer = SPTokenizer(vocab_file, num_image_tokens=num_image_tokens)

This error does not appear in transformers==4.33.0, but it is reported in the latest version 4.40.2, which is related to the update of the PretrainedTokenizer class.

details in https://zhuanlan.zhihu.com/p/697342575

THUDM / ChatGLM-6B