[Question]: 新手求教text_summarization报错问题

PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

https://paddlenlp.readthedocs.io

Apache License 2.0

12.01k stars 2.93k forks source link

[Question]: 新手求教text_summarization报错问题 #4645

Closed KenPanda closed 1 year ago

KenPanda commented 1 year ago

请提出你的问题

本人go程序员，目前在学习python，想自己训练有监督摘要模型，选中text_summarization，但按说明部署运行后报错如下如下

unset CUDA_VISIBLE_DEVICES

py -m paddle.distributed.launch --gpus "0" train.py \ --model_name_or_path=IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese \ --train_file data/train.json \ --eval_file data/test.json \ --output_dir pegasus_out \ --max_source_length 128 \ --max_target_length 64 \ --epoch 20 \ --logging_steps 1 \ --save_steps 10000 \ --train_batch_size 128 \ --eval_batch_size 128 \ --learning_rate 5e-5 \ --warmup_proportion 0.02 \ --weight_decay=0.01 \ --device=gpu \

Traceback (most recent call last): File "C:\Users\G\Desktop\PaddleNLP-develop\applications\text_summarization\pegasus\train.py", line 296, in args = parse_args() File "C:\Users\G\Desktop\PaddleNLP-develop\applications\text_summarization\pegasus\train.py", line 130, in parse_args parser.add_argument("--use_SSTIA", action="store_true", type=bool, help="Whether to use SSTIA.") File "C:\Users\G\AppData\Local\Programs\Python\Python310\lib\argparse.py", line 1423, in add_argument action = action_class(**kwargs) TypeError: _StoreTrueAction.init() got an unexpected keyword argument 'type'

gongel commented 1 year ago

你好，这里https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/text_summarization/pegasus/train.py#L130 改成

parser.add_argument("--use_SSTIA", action="store_true", help="Whether to use SSTIA.")

gongel commented 1 year ago

pin @LazyFyh

KenPanda commented 1 year ago

@gongel 大神，按您说的改完后，还是报错，报错如下 $ unset CUDA_VISIBLE_DEVICES python -m paddle.distributed.launch --gpus "0" train.py \ --model_name_or_path=Randeng-Pegasus-238M-Summary-Chinese \ --train_file train.json \ --eval_file test.json \ --output_dir pegasus_out \ --max_source_length 128 \ --max_target_length 64 \ --epoch 20 \ --logging_steps 1 \ --save_steps 10000 \ --train_batch_size 128 \ --eval_batch_size 128 \ --learning_rate 5e-5 \ --warmup_proportion 0.02 \ --weight_decay=0.01 \ --device=gpu \ Traceback (most recent call last): File "C:\Users\G\Desktop\PaddleNLP-develop\applications\text_summarization\pegasus\train.py", line 298, in do_train(args) File "C:\Users\G\Desktop\PaddleNLP-develop\applications\text_summarization\pegasus\train.py", line 199, in do_train model = PegasusForConditionalGeneration.from_pretrained(args.model_name_or_path) File "C:\Users\G.conda\envs\beta1\lib\site-packages\paddlenlp\transformers\model_utils.py", line 537, in from_pretrained resolved_resource_files[file_id] = get_path_from_url_with_filelock(file_path, default_root) File "C:\Users\G.conda\envs\beta1\lib\site-packages\paddlenlp\utils\downloader.py", line 192, in get_path_from_url_with_filelock result = get_path_from_url(url=url, root_dir=root_dir, md5sum=md5sum, check_exist=check_exist) File "C:\Users\G.conda\envs\beta1\lib\site-packages\paddlenlp\utils\downloader.py", line 150, in get_path_from_url assert is_url(url), "downloading from {} not a url".format(url) AssertionError: downloading from Randeng-Pegasus-238M-Summary-Chinese\model_state.pdparams not a url

gongel commented 1 year ago

model_name_or_path is wrong

KenPanda commented 1 year ago

@gongel 其他的错误不用理会吗？ model_name_or_path 我反复修改了N次，各种报错，请问能指点一下该如何设置吗？

gongel commented 1 year ago

Change to: --model_name_or_path=IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese

KenPanda commented 1 year ago

@gongel 按您说的，也就是官方文档上的设置路径不行，我是windows，我在反复尝试中

KenPanda commented 1 year ago

@gengel 大神，我尝试了10种设置，但是output_dir总是报这种错 $ unset CUDA_VISIBLE_DEVICES==0 python -m paddle.distributed.launch --gpus "0" train.py \ --model_name_or_path=IDEA-CCNL\Randeng-Pegasus-238M-Summary-Chinese\ --train_file train.json \ --eval_file test.json \ --output_dir pegasus_out \ --max_source_length 128 \ --max_target_length 64 \ --epoch 20 \ --logging_steps 1 \ --save_steps 10000 \ --train_batch_size 128 \ --eval_batch_size 128 \ --learning_rate 5e-5 \ --warmup_proportion 0.02 \ --weight_decay=0.01 \ --device=gpu \

usage: train.py [-h] [--model_name_or_path MODEL_NAME_OR_PATH] [--train_file TRAIN_FILE] [--eval_file EVAL_FILE] --output_dir OUTPUT_DIR [--max_source_length MAX_SOURCE_LENGTH] [--min_target_length MIN_TARGET_LENGTH] [--max_target_length MAX_TARGET_LENGTH] [--learning_rate LEARNING_RATE] [--epoch EPOCH] [--logging_steps LOGGING_STEPS] [--save_steps SAVE_STEPS] [--train_batch_size TRAIN_BATCH_SIZE] [--eval_batch_size EVAL_BATCH_SIZE] [--weight_decay WEIGHT_DECAY] [--warmup_steps WARMUP_STEPS] [--warmup_proportion WARMUP_PROPORTION] [--adam_epsilon ADAM_EPSILON] [--max_steps MAX_STEPS] [--seed SEED] [--device {cpu,gpu,xpu}] [--use_amp USE_AMP] [--scale_loss SCALE_LOSS] [--use_SSTIA] [--mix_ratio MIX_RATIO] train.py: error: the following arguments are required: --output_dir

LazyFyh commented 1 year ago

你好，这里https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/text_summarization/pegasus/train.py#L130 改成
parser.add_argument("--use_SSTIA", action="store_true", help="Whether to use SSTIA.")

该bug已经修复：https://github.com/PaddlePaddle/PaddleNLP/pull/4646

gongel commented 1 year ago

你好，windows命令行需要这么使用哈：python -m paddle.distributed.launch --gpus "0" train.py --train_file train.json --eval_file test.json ...

KenPanda commented 1 year ago

@gengel 谢谢您，我查了很多学习了很多python的语法，之前已经摸索出来了一种写法，您的写法更简洁，两种都ok，现在cpu下已经正常跑起来了，但是Gpu下跑起来显存会直接几乎占满，然后几秒钟就自动退出了，提示如下：

$ unset CUDA_VISIBLE_DEVICES

python -m paddle.distributed.launch --gpus "0" train.py --model_name_or_path=IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese --train_file data/train.json --eval_file data/test.json --output_dir pegasus_out --max_source_length 128 --max_target_length 64 --epoch 20 --logging_steps 1 --save_steps 10000 --train_batch_size 128 --eval_batch_size 128 --learning_rate 5e-5 --warmup_proportion 0.02 --weight_decay=0.01 --device=gpu [2023-02-06 01:48:42,008] [ WARNING] arrow_dataset.py:3036 - Loading cached processed dataset at C:\Users\G.cache\huggingface\datasets\json\default-b7260c8ec883c6c8\0.0.0\0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51\cache-a7cae6a9d7e8c899.arrow [2023-02-06 01:48:42,432] [ WARNING] arrow_dataset.py:3036 - Loading cached processed dataset at C:\Users\G.cache\huggingface\datasets\json\default-fbdafb0f6261405c\0.0.0\0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51\cache-328d667dfd908829.arrow [2023-02-06 01:48:42,434] [ INFO] - Already cached C:\Users\G.paddlenlp\models\IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese\model_state.pdparams [2023-02-06 01:48:42,434] [ INFO] - Already cached C:\Users\G.paddlenlp\models\IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese\model_config.json W0206 01:48:42.436866 10792 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 12.0, Runtime API Version: 10.2 W0206 01:48:42.503891 10792 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6.

gongel commented 1 year ago

尝试降低train_batch_size和eval_batch_size

KenPanda commented 1 year ago

@gongel 谢谢大神，早试过，都降到2或1也问题依旧，求教！

KenPanda commented 1 year ago

@gongel 有时可以完成1 step，有时1 step也完成不了

KenPanda commented 1 year ago

@gongel cpu跑训练目前正常

gongel commented 1 year ago

gpu显存多大的？有可能gpu显存太小。

KenPanda commented 1 year ago

@gongel gpu显存6个G，开跑10秒左右会到5.7G或5.8G，然后维持几秒就退出了，给出上面的提示。我尝试了所有能降低的参数，试图降低显存消耗，没效果，也无法延迟运行时间，另外我尝试了10余个版本，2.4.X之前一些版本，跑起来会显存固定在4.X左右，然后无任何有效输出，也不会进行step

gongel commented 1 year ago

6g跑不起来的，最好要16g以上

KenPanda commented 1 year ago

@gongel 那也就是说1660Ti，跑不了咯？即使用共享显存这些？另外想请教，咱PaddleNPL是否支持cpu多核运算呢？该如何设置？方便指教否？

gongel commented 1 year ago

你好，1660Ti 6G跑不起来，可以去AIStudio申请免费的大显卡使用：https://aistudio.baidu.com/aistudio/index，或者尝试更换更大的GPU、或者用CPU

Jason916 commented 1 year ago

https://aistudio.baidu.com/aistudio/index

如果是12G显存跑得起来吗。。