PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.01k stars 2.93k forks source link

[Bug]: 执行教程出现问题,utc-base 权重文件下载地址错误 #4816

Closed Han-YLun closed 1 year ago

Han-YLun commented 1 year ago

软件环境

- paddlepaddle:2.4.1
- paddlenlp: 2.5.0

https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/zero_shot_text_classification#%E6%A8%A1%E5%9E%8B%E5%BE%AE%E8%B0%83

执行这个命令出现问题

python run_train.py  \
    --device cpu \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --seed 1000 \
    --model_name_or_path utc-base \
    --output_dir ./checkpoint/model_best \
    --dataset_path ./data/ \
    --max_seq_length 512  \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --num_train_epochs 20 \
    --learning_rate 1e-5 \
    --do_train \
    --do_eval \
    --do_export \
    --export_model_dir ./checkpoint/model_best \
    --overwrite_output_dir \
    --disable_tqdm True \
    --metric_for_best_model macro_f1 \
    --load_best_model_at_end  True \
    --save_total_limit 1 \
    --save_plm

/Users/arvinyl/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
[2023-02-15 19:50:15,085] [ WARNING] - evaluation_strategy reset to IntervalStrategy.STEPS for do_eval is True. you can also set evaluation_strategy='epoch'.
[2023-02-15 19:50:15,085] [    INFO] - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
[2023-02-15 19:50:15,085] [    INFO] - ============================================================
[2023-02-15 19:50:15,085] [    INFO] -      Model Configuration Arguments
[2023-02-15 19:50:15,085] [    INFO] - paddle commit id              :4743cc8b9a8d77ea47e08a42a16246b538bda56f
[2023-02-15 19:50:15,085] [    INFO] - export_model_dir              :./checkpoint/model_best
[2023-02-15 19:50:15,085] [    INFO] - export_type                   :paddle
[2023-02-15 19:50:15,085] [    INFO] - model_name_or_path            :utc-base
[2023-02-15 19:50:15,085] [    INFO] -
[2023-02-15 19:50:15,085] [    INFO] - ============================================================
[2023-02-15 19:50:15,085] [    INFO] -       Data Configuration Arguments
[2023-02-15 19:50:15,085] [    INFO] - paddle commit id              :4743cc8b9a8d77ea47e08a42a16246b538bda56f
[2023-02-15 19:50:15,085] [    INFO] - dataset_path                  :./data/
[2023-02-15 19:50:15,085] [    INFO] - dev_file                      :dev.txt
[2023-02-15 19:50:15,085] [    INFO] - threshold                     :0.5
[2023-02-15 19:50:15,085] [    INFO] - train_file                    :train.txt
[2023-02-15 19:50:15,085] [    INFO] -
[2023-02-15 19:50:15,085] [    INFO] - Downloading tokenizer_config.json from https://bj.bcebos.com/paddlenlp/models/community//utc-base/tokenizer_config.json
[2023-02-15 19:50:15,253] [   ERROR] - Downloading from https://bj.bcebos.com/paddlenlp/models/community//utc-base/tokenizer_config.json failed with code 404!
Traceback (most recent call last):
  File "/Users/arvinyl/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/paddlenlp/transformers/auto/tokenizer.py", line 323, in from_pretrained
    resolved_vocab_file = get_path_from_url(community_config_path, default_root)
  File "/Users/arvinyl/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/paddlenlp/utils/downloader.py", line 157, in get_path_from_url
    fullpath = _download(url, root_dir, md5sum)
  File "/Users/arvinyl/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/paddlenlp/utils/downloader.py", line 219, in _download
    raise RuntimeError("Downloading from {} failed with code " "{}!".format(url, req.status_code))
RuntimeError: Downloading from https://bj.bcebos.com/paddlenlp/models/community//utc-base/tokenizer_config.json failed with code 404!

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/arvinyl/Projects/PaddleNLP/applications/zero_shot_text_classification/run_train.py", line 142, in <module>
    main()
  File "/Users/arvinyl/Projects/PaddleNLP/applications/zero_shot_text_classification/run_train.py", line 66, in main
    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
  File "/Users/arvinyl/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/paddlenlp/transformers/auto/tokenizer.py", line 326, in from_pretrained
    raise RuntimeError(
RuntimeError: Can't load tokenizer for 'utc-base'.
Please make sure that 'utc-base' is:
- a correct model-identifier of built-in pretrained models,
- or a correct model-identifier of community-contributed pretrained models,
- or the correct path to a directory containing relevant tokenizer files.

### 重复问题

- [X] I have searched the existing issues

### 错误描述

```Markdown
python run_train.py  \
    --device cpu \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --seed 1000 \
    --model_name_or_path utc-base \
    --output_dir ./checkpoint/model_best \
    --dataset_path ./data/ \
    --max_seq_length 512  \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --num_train_epochs 20 \
    --learning_rate 1e-5 \
    --do_train \
    --do_eval \
    --do_export \
    --export_model_dir ./checkpoint/model_best \
    --overwrite_output_dir \
    --disable_tqdm True \
    --metric_for_best_model macro_f1 \
    --load_best_model_at_end  True \
    --save_total_limit 1 \
    --save_plm

/Users/arvinyl/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
[2023-02-15 19:50:15,085] [ WARNING] - evaluation_strategy reset to IntervalStrategy.STEPS for do_eval is True. you can also set evaluation_strategy='epoch'.
[2023-02-15 19:50:15,085] [    INFO] - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
[2023-02-15 19:50:15,085] [    INFO] - ============================================================
[2023-02-15 19:50:15,085] [    INFO] -      Model Configuration Arguments
[2023-02-15 19:50:15,085] [    INFO] - paddle commit id              :4743cc8b9a8d77ea47e08a42a16246b538bda56f
[2023-02-15 19:50:15,085] [    INFO] - export_model_dir              :./checkpoint/model_best
[2023-02-15 19:50:15,085] [    INFO] - export_type                   :paddle
[2023-02-15 19:50:15,085] [    INFO] - model_name_or_path            :utc-base
[2023-02-15 19:50:15,085] [    INFO] -
[2023-02-15 19:50:15,085] [    INFO] - ============================================================
[2023-02-15 19:50:15,085] [    INFO] -       Data Configuration Arguments
[2023-02-15 19:50:15,085] [    INFO] - paddle commit id              :4743cc8b9a8d77ea47e08a42a16246b538bda56f
[2023-02-15 19:50:15,085] [    INFO] - dataset_path                  :./data/
[2023-02-15 19:50:15,085] [    INFO] - dev_file                      :dev.txt
[2023-02-15 19:50:15,085] [    INFO] - threshold                     :0.5
[2023-02-15 19:50:15,085] [    INFO] - train_file                    :train.txt
[2023-02-15 19:50:15,085] [    INFO] -
[2023-02-15 19:50:15,085] [    INFO] - Downloading tokenizer_config.json from https://bj.bcebos.com/paddlenlp/models/community//utc-base/tokenizer_config.json
[2023-02-15 19:50:15,253] [   ERROR] - Downloading from https://bj.bcebos.com/paddlenlp/models/community//utc-base/tokenizer_config.json failed with code 404!
Traceback (most recent call last):
  File "/Users/arvinyl/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/paddlenlp/transformers/auto/tokenizer.py", line 323, in from_pretrained
    resolved_vocab_file = get_path_from_url(community_config_path, default_root)
  File "/Users/arvinyl/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/paddlenlp/utils/downloader.py", line 157, in get_path_from_url
    fullpath = _download(url, root_dir, md5sum)
  File "/Users/arvinyl/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/paddlenlp/utils/downloader.py", line 219, in _download
    raise RuntimeError("Downloading from {} failed with code " "{}!".format(url, req.status_code))
RuntimeError: Downloading from https://bj.bcebos.com/paddlenlp/models/community//utc-base/tokenizer_config.json failed with code 404!

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/arvinyl/Projects/PaddleNLP/applications/zero_shot_text_classification/run_train.py", line 142, in <module>
    main()
  File "/Users/arvinyl/Projects/PaddleNLP/applications/zero_shot_text_classification/run_train.py", line 66, in main
    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
  File "/Users/arvinyl/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/paddlenlp/transformers/auto/tokenizer.py", line 326, in from_pretrained
    raise RuntimeError(
RuntimeError: Can't load tokenizer for 'utc-base'.
Please make sure that 'utc-base' is:
- a correct model-identifier of built-in pretrained models,
- or a correct model-identifier of community-contributed pretrained models,
- or the correct path to a directory containing relevant tokenizer files.

稳定复现步骤 & 代码

    --device cpu \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --seed 1000 \
    --model_name_or_path utc-base \
    --output_dir ./checkpoint/model_best \
    --dataset_path ./data/ \
    --max_seq_length 512  \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --num_train_epochs 20 \
    --learning_rate 1e-5 \
    --do_train \
    --do_eval \
    --do_export \
    --export_model_dir ./checkpoint/model_best \
    --overwrite_output_dir \
    --disable_tqdm True \
    --metric_for_best_model macro_f1 \
    --load_best_model_at_end  True \
    --save_total_limit 1 \
    --save_plm

/Users/arvinyl/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
[2023-02-15 19:50:15,085] [ WARNING] - evaluation_strategy reset to IntervalStrategy.STEPS for do_eval is True. you can also set evaluation_strategy='epoch'.
[2023-02-15 19:50:15,085] [    INFO] - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
[2023-02-15 19:50:15,085] [    INFO] - ============================================================
[2023-02-15 19:50:15,085] [    INFO] -      Model Configuration Arguments
[2023-02-15 19:50:15,085] [    INFO] - paddle commit id              :4743cc8b9a8d77ea47e08a42a16246b538bda56f
[2023-02-15 19:50:15,085] [    INFO] - export_model_dir              :./checkpoint/model_best
[2023-02-15 19:50:15,085] [    INFO] - export_type                   :paddle
[2023-02-15 19:50:15,085] [    INFO] - model_name_or_path            :utc-base
[2023-02-15 19:50:15,085] [    INFO] -
[2023-02-15 19:50:15,085] [    INFO] - ============================================================
[2023-02-15 19:50:15,085] [    INFO] -       Data Configuration Arguments
[2023-02-15 19:50:15,085] [    INFO] - paddle commit id              :4743cc8b9a8d77ea47e08a42a16246b538bda56f
[2023-02-15 19:50:15,085] [    INFO] - dataset_path                  :./data/
[2023-02-15 19:50:15,085] [    INFO] - dev_file                      :dev.txt
[2023-02-15 19:50:15,085] [    INFO] - threshold                     :0.5
[2023-02-15 19:50:15,085] [    INFO] - train_file                    :train.txt
[2023-02-15 19:50:15,085] [    INFO] -
[2023-02-15 19:50:15,085] [    INFO] - Downloading tokenizer_config.json from https://bj.bcebos.com/paddlenlp/models/community//utc-base/tokenizer_config.json
[2023-02-15 19:50:15,253] [   ERROR] - Downloading from https://bj.bcebos.com/paddlenlp/models/community//utc-base/tokenizer_config.json failed with code 404!
Traceback (most recent call last):
  File "/Users/arvinyl/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/paddlenlp/transformers/auto/tokenizer.py", line 323, in from_pretrained
    resolved_vocab_file = get_path_from_url(community_config_path, default_root)
  File "/Users/arvinyl/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/paddlenlp/utils/downloader.py", line 157, in get_path_from_url
    fullpath = _download(url, root_dir, md5sum)
  File "/Users/arvinyl/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/paddlenlp/utils/downloader.py", line 219, in _download
    raise RuntimeError("Downloading from {} failed with code " "{}!".format(url, req.status_code))
RuntimeError: Downloading from https://bj.bcebos.com/paddlenlp/models/community//utc-base/tokenizer_config.json failed with code 404!

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/arvinyl/Projects/PaddleNLP/applications/zero_shot_text_classification/run_train.py", line 142, in <module>
    main()
  File "/Users/arvinyl/Projects/PaddleNLP/applications/zero_shot_text_classification/run_train.py", line 66, in main
    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
  File "/Users/arvinyl/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/paddlenlp/transformers/auto/tokenizer.py", line 326, in from_pretrained
    raise RuntimeError(
RuntimeError: Can't load tokenizer for 'utc-base'.
Please make sure that 'utc-base' is:
- a correct model-identifier of built-in pretrained models,
- or a correct model-identifier of community-contributed pretrained models,
- or the correct path to a directory containing relevant tokenizer files.
LemonNoel commented 1 year ago

utc-base 暂未发版,需要安装 develop 版本使用,命令

pip install --pre paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html
Han-YLun commented 1 year ago

用develop版本安装出现这个问题

ModuleNotFoundError: No module named 'paddlenlp.transformers.mt5'
chinesejunzai12 commented 1 year ago

我也出现同样的问题 , 我发现在下载模型的过程中: image 路径不对吧, 导致下载错误.

Han-YLun commented 1 year ago

@chinesejunzai12 我最开始路径也不对,执行这个

pip install --pre paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html

我执行完出现

ModuleNotFoundError: No module named 'paddlenlp.transformers.mt5'
chinesejunzai12 commented 1 year ago

@chinesejunzai12 我最开始路径也不对,执行这个

pip install --pre paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html

我执行完出现

ModuleNotFoundError: No module named 'paddlenlp.transformers.mt5'

现在问题解决了没

Han-YLun commented 1 year ago

没有,我现在出现下面那个,解决不了

Han-YLun commented 1 year ago

@LemonNoel 老哥 可以帮我看下嘛

chinesejunzai12 commented 1 year ago

我也是刚用这个,也在找方案,目前还没有解决

来自钉钉专属商务邮箱------------------------------------------------------------------ @.> 日 期:2023年02月17日 16:50:27 @.> @.>; @.> 主 题:Re: [PaddlePaddle/PaddleNLP] [Bug]: 执行教程出现问题,utc-base 权重文件下载地址错误 (Issue #4816)

@LemonNoel 老哥 可以帮我看下嘛 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

joebnb commented 1 year ago

他们自己打的最新 docker 镜像也是这个问题

image

Han-YLun commented 1 year ago

他们自己打的最新 docker 镜像也是这个问题

image

这不是同一个问题吧,你这是paddlenlp 没安装吧

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。