PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.17k stars 2.95k forks source link

UIE 微调中断,无明显错误日志 Exit code -6 #3869

Closed shuiiiiiimu closed 2 years ago

shuiiiiiimu commented 2 years ago

Window WSL2 paddlenlp 2.4.3 paddlepaddle-gpu 2.4.0rc0

抽取式任务。 准备了 70 条训练数据,标注了 6 个标签。操作步骤以及参数都是参考 model_zoo/uie#4-训练定制

从样本数据中 head 6 条用于训练,正常。

样本 70 条数据用于训练,直接中断,无明显日志。日志如下:

   .......  其他 INFO ......
LAUNCH INFO 2022-11-23 14:02:02,328 Pod failed
INFO 2022-11-23 14:02:02,328 controller.py:109] Pod failed
LAUNCH ERROR 2022-11-23 14:02:02,328 Container failed !!!
   .......  其他 INFO ......
ERROR 2022-11-23 14:02:02,328 controller.py:110] Container failed !!!
   .......  其他 INFO ......
LAUNCH INFO 2022-11-23 14:02:02,328 ------------------------- ERROR LOG DETAIL -------------------------
INFO 2022-11-23 14:02:02,328 controller.py:111] ------------------------- ERROR LOG DETAIL -------------------------
01:49,570] [    INFO] - remove_unused_columns         :True
[2022-11-23 14:01:49,571] [    INFO] - report_to                     :['visualdl']
[2022-11-23 14:01:49,571] [    INFO] - resume_from_checkpoint        :None
[2022-11-23 14:01:49,571] [    INFO] - round_type                    :round
[2022-11-23 14:01:49,571] [    INFO] - run_name                      :./checkpoint/model_best
[2022-11-23 14:01:49,571] [    INFO] - save_on_each_node             :False
[2022-11-23 14:01:49,571] [    INFO] - save_steps                    :100
[2022-11-23 14:01:49,571] [    INFO] - save_strategy                 :IntervalStrategy.STEPS
[2022-11-23 14:01:49,571] [    INFO] - save_total_limit              :1
[2022-11-23 14:01:49,571] [    INFO] - scale_loss                    :32768
[2022-11-23 14:01:49,571] [    INFO] - seed                          :42
[2022-11-23 14:01:49,571] [    INFO] - sharding                      :[]
[2022-11-23 14:01:49,571] [    INFO] - sharding_degree               :-1
[2022-11-23 14:01:49,571] [    INFO] - should_log                    :True
[2022-11-23 14:01:49,571] [    INFO] - should_save                   :True
[2022-11-23 14:01:49,571] [    INFO] - strategy                      :dynabert+ptq
[2022-11-23 14:01:49,571] [    INFO] - train_batch_size              :16
[2022-11-23 14:01:49,572] [    INFO] - use_pact                      :True
[2022-11-23 14:01:49,572] [    INFO] - warmup_ratio                  :0.1
[2022-11-23 14:01:49,572] [    INFO] - warmup_steps                  :0
[2022-11-23 14:01:49,572] [    INFO] - weight_decay                  :0.0
[2022-11-23 14:01:49,572] [    INFO] - weight_quantize_type          :channel_wise_abs_max
[2022-11-23 14:01:49,572] [    INFO] - width_mult_list               :None
[2022-11-23 14:01:49,572] [    INFO] - world_size                    :2
[2022-11-23 14:01:49,572] [    INFO] -
[2022-11-23 14:01:49,598] [    INFO] - ***** Running training *****
[2022-11-23 14:01:49,598] [    INFO] -   Num examples = 330
[2022-11-23 14:01:49,598] [    INFO] -   Num Epochs = 100
[2022-11-23 14:01:49,598] [    INFO] -   Instantaneous batch size per device = 16
[2022-11-23 14:01:49,598] [    INFO] -   Total train batch size (w. parallel, distributed & accumulation) = 32
[2022-11-23 14:01:49,598] [    INFO] -   Gradient Accumulation steps = 1
[2022-11-23 14:01:49,598] [    INFO] -   Total optimization steps = 1100.0
[2022-11-23 14:01:49,598] [    INFO] -   Total num train samples = 33000.0
[2022-11-23 14:01:49,663] [    INFO] -   Number of trainable parameters = 117946370
LAUNCH INFO 2022-11-23 14:02:02,329 Exit code -6
INFO 2022-11-23 14:02:02,329 controller.py:141] Exit code -6

workerlog.0 workerlog.1 没有 ERROR 日志。

这种情况有什么办法可以定位到问题?或者 debug 思路可以分享一下?

linjieccc commented 2 years ago

Hi,

请问单卡微调是否有问题?方便的话可以提供下可复现的最小数据集合

shuiiiiiimu commented 2 years ago

GPU 单卡直接 Aborted。 没其他 ERROR 了。数据集可能不太适合放出来。

你可以分享一下你的排查经验吗?

shuiiiiiimu commented 2 years ago

为了定位问题,我把 70 条数据集,分了几批,比如 head 6 / head 12 / head 24 ...

都是同样的 Exit。 其中有一次出现 GPU memory 的问题。 然后做了几个事情: 1)!nvidia-smi。 无占用 2)!fuser -v /dev/nvidia*。也没有返回。 3)ps -ef 看到不少 jupyter 的进程。 统统 kill,重启 Jupyter。 4)batch_size 调低 16 -> 8 。 5)max_seq_length 也根据自己的情况 512 -> 256。

直接跑 70 条数据,目前全部正常了。 也不知道哪一步环节是关键的。 一轮操作下来,可以 work 了。

以上,给后来的各位做个参考吧。

stitchshaw commented 1 year ago

感谢题主。我把batch_size 32->16, max_seq_length 198->128 work了。