“f1: 0.00000, precision: 1.00000, recall: 0.00000, best f1: 0.00000”

ditingdapeng commented 3 years ago

文件：bert4keras/examples/task_relation_extraction.py

基本信息

你使用的操作系统: Ubuntu 20
你使用的Python版本: 3.6
你使用的Tensorflow版本: 2.2.0
你使用的Keras版本: 2.3.1
你使用的bert4keras版本: 0.9.3
你使用纯keras还是tf.keras:
你加载的预训练模型: Chinese-BERT-wwm

输出信息

训练了１０个Epoch，

f1: 0.00000, precision: 1.00000, recall: 0.00000, best f1: 0.00000

WatsonWangZh commented 3 years ago

tf 1.14 may help you out.

ditingdapeng commented 3 years ago

tf 1.14 may help you out.

感谢回复！但是我使用tf1.14, keras2.3.1时，会提示“Use tf.where in 2.0, which has the same broadcast rule as np.where”的warning，紧接着程序就停止了，请问这应该怎么改进？

ditingdapeng commented 3 years ago

waring1: Use tf.where in 2.0, which has the same broadcast rule as np.where; Waing2: The name tf.global_variables is deprecated,Please use tf.compat.v1.global_variables instead;

但是指向的位置是在tensorflow的包中，该怎么修改？

ditingdapeng commented 3 years ago

根据问题路径换了1.5版本的tensorflow，依旧没有解决

ditingdapeng commented 3 years ago

换成2.20的tensorflow可以跑，但是还是会出现f1: 0.00000, precision: 1.00000, recall: 0.00000, best f1: 0.00000的错误

bojone commented 3 years ago

这个问题我没法帮你解决，但是有几点观测供你参考：

1、warning不是错误，tf中出现warning是很正常的事情； 2、所以我不知道你的“紧接着程序就停止了”是怎么回事，但可以确定跟warning没关系，建议自行排查； 3、tf 1.14 + keras 2.3.1是我开发时用的环境，按理说出错的概率最小； 4、“f1: 0.00000, precision: 1.00000, recall: 0.00000, best f1: 0.00000”不是错误，只是模型没优化成功； 5、模型优化不成功，这是一个“运气”问题，我目前没法解决； 6、没错，确实是“运气”问题，因为同一个脚本，同一个环境，有人跑成功过（第一个epoch的loss降到1以下），有人就是死活跑不成功（loss死活不降），有人坚持跑了七八十个epoch才成功（很多个epoch之后loss才降低下来）； 7、再次强调，这是一个“运气”问题，下面是一些“碰运气”的思路：

如果你换了自己的数据，那么要注意跑更多的epoch（原数据大约有17万条，如果你自己的数据只有1万条，那么你要跑17个epoch，才能顶得上原数据的1个epoch）；
如果你直接用的原数据，那么还可以尝试坚持跑更多的epoch；
可以试试去掉ema；
试试去掉dropout/增大dropout；
换个优化器；
其他你能想到的改变随机因素的方法；

ditingdapeng commented 3 years ago

谢谢苏神回复，我排查下原因！

ditingdapeng commented 3 years ago

/home/think/anaconda3/envs/qa_py2.7/bin/python2.7 /home/think/Code/NER/ner_extract.py Using TensorFlow backend. 2020-11-28 13:20:12.850430: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-11-28 13:20:12.885629: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz 2020-11-28 13:20:12.886703: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560a632e7ab0 executing computations on platform Host. Devices: 2020-11-28 13:20:12.886759: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , 2020-11-28 13:20:13.101683: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. /home/think/anaconda3/envs/qa_py2.7/lib/python2.7/site-packages/keras/engine/training_utils.py:819: UserWarning: Output total_loss_1 missing from loss dictionary. We assume this was done on purpose. The fit and evaluate APIs will not be expecting any data to be passed to total_loss_1. 'be expecting any data to be passed to {0}.'.format(name)) WARNING:tensorflow:From /home/think/anaconda3/envs/qa_py2.7/lib/python2.7/site-packages/tensorflow/python/ops/math_grad.py:1250: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /home/think/anaconda3/envs/qa_py2.7/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

Epoch 1/20

Process finished with exit code 137 (interrupted by signal 9: SIGKILL) －－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－

我还用了py2.7,tensor 1.14, keras==2.3.1,这是直接运行的结果－－　一趟ｅｐｏｃｈ就停掉了，什么原因呢？

ditingdapeng commented 3 years ago

苏神您好，之前出现的问题是内存溢出，当我把batch_size设小时，问题已经解决。现在已经进行到了：53865/86554：loss :0.0338，而0.0338一直没有再下降，请问如果我此时停掉的话，模型会进行保存吗或怎样把模型进行保存？

感谢回复！

bojone commented 3 years ago

苏神您好，之前出现的问题是内存溢出，当我把batch_size设小时，问题已经解决。现在已经进行到了：53865/86554：loss :0.0338，而0.0338一直没有再下降，请问如果我此时停掉的话，模型会进行保存吗或怎样把模型进行保存？

感谢回复！

每个epoch保存一次，没有达到一个epoch不会保存

ditingdapeng commented 3 years ago

感谢苏神回复！

ditingdapeng commented 3 years ago

苏神，请问模型文件是保存在哪里？

ditingdapeng commented 3 years ago

想表达的意思是：通过save_weights来保存的模型参数，如何作为模型来使用呢？

moriwang commented 3 years ago

只是模型没优化成功

关于第六点，我可以贡献一点数据。这个 example 我在不同机器上跑过很多次，觉得最主要的就是不要用 tf2.0 以上，全是 bug。在 tf1.15 + keras 2.3.1 下，原始数据第一个 epoch 就能到 1 以内，大概 7 个 epoch 达到最佳。显存 12G 以下可能训练途中会爆。

ditingdapeng commented 3 years ago

只是模型没优化成功

关于第六点，我可以贡献一点数据。这个 example 我在不同机器上跑过很多次，觉得最主要的就是不要用 tf2.0 以上，全是 bug。在 tf1.15 + keras 2.3.1 下，原始数据第一个 epoch 就能到 1 以内，大概 7 个 epoch 达到最佳。显存 12G 以下可能训练途中会爆。

多谢老哥回复！换了苏神的原版本环境，已经成功(py2.7)。但是现在还有问题没有解决：

请问如何用该模型加载来进行predict呢？
另外predict的参数该是什么？(我看到是sequence数据？不能直接用text吗)

感谢苏神和大佬解答!

bojone commented 3 years ago

请问如何用该模型加载来进行predict呢？

另外predict的参数该是什么？(我看到是sequence数据？不能直接用text吗)

你是刚学编程吗？整个脚本包括训练、预测、打分整个流程都有了，你来问这种问题？你理解了整个脚本了没有呢？

ditingdapeng commented 3 years ago

谢谢指导苏神，已经用extract_spoes预测到了结果，但是在输出时，遇到了奇怪的编码问题：

通过用chardet来判断，第一个实体是TIS-620编码，第二个是Unicode编码，第三个是ISO编码。

为什么会出现三种不同的编码方式呢？

dtMndas commented 2 years ago

遇到同样问题，想问下是怎么解决的呢？

bojone / bert4keras

“f1: 0.00000, precision: 1.00000, recall: 0.00000, best f1: 0.00000” #257

基本信息

输出信息