JasonForJoy / SA-BERT

CIKM 2020: Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based Chatbots
75 stars 13 forks source link

InvalidArgumentError: Key: segment_ids. Can't parse serialized Example. #4

Closed cccccs closed 4 years ago

cccccs commented 4 years ago

你好,我在运行时出现以下错误。max_seq_length为512,内存溢出。因此max_seq_length改为了300,然后出现了segment_ids. Can't parse serialized Example.的错误。不知道问题出在哪,python版本和tensorflow版本也一致。 TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10130 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:05.0, compute capability: 7.5) INFO:tensorflow:Epoch 0 training begin 2020-05-27 07:48:15.881624: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally 2020-05-27 07:48:16.499166: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: segment_ids. Can't parse serialized Example. Traceback (most recent call last): File "/usr/local/miniconda3/envs/dl/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/usr/local/miniconda3/envs/dl/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/miniconda3/envs/dl/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Key: segment_ids. Can't parse serialized Example. [[{{node ParseSingleExample/ParseSingleExample}}]] [[{{node IteratorGetNext}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "../train.py", line 456, in tf.app.run() File "/usr/local/miniconda3/envs/dl/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "../train.py", line 446, in main run_epoch(epoch, "train", sess, training, logits, accuracy, mean_loss, train_opt) File "../train.py", line 284, in run_epoch batch_logits, batchloss, , accur= sess.run([logits, mean_loss, train_opt, accuracy], feed_dict={training:True}) File "/usr/local/miniconda3/envs/dl/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/usr/local/miniconda3/envs/dl/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/miniconda3/envs/dl/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/usr/local/miniconda3/envs/dl/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Key: segment_ids. Can't parse serialized Example. [[{{node ParseSingleExample/ParseSingleExample}}]] [[node IteratorGetNext (defined at ../train.py:400) ]]

JasonForJoy commented 4 years ago

@cccccs 修改 max_seq_length 这个参数时,需要修改两处: (1)数据预处理步骤的脚本,即 data/Ubuntu_V1_Xu/data_preprocess.py (2)另一个就是训练时的脚本,即 scripts/ubuntu_v1_train.sh 前后两个需要一致。

cccccs commented 4 years ago

你好 我按照你说的方法修改max_seq_length之后,现在进行训练遇到了以下问题: log_train.txt的文件不再增加内容,文件的末尾内容如下,也没见提示什么错误,并且查看gpu的状态还是显示gpu一直在运行,占比99%。 8470MiB / 11176MiB | 99% Default | 2020-05-28 06:41:22.495544: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2020-05-28 06:41:22.496273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-05-28 06:41:22.496287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 2020-05-28 06:41:22.496294: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N 2020-05-28 06:41:22.496369: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-28 06:41:22.496692: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-05-28 06:41:22.496986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10478 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) I0528 06:41:35.293003 139756198102784 train.py:443] Epoch 0 training begin 2020-05-28 06:41:35.454630: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. 2020-05-28 06:41:40.097569: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0

JasonForJoy commented 4 years ago

@cccccs 你这个是运行正常了 因为计算量比较大,所以显示的会慢一些 如果你要打印训练的步骤频繁一些,你可以修改 train.py line 238: if n_updates%2000 == 0

cccccs commented 4 years ago

好的 谢谢

cccccs commented 4 years ago

你好 还有一个问题 就是我在Bert模型之后加入一层网络之后,那模型保存的话会保存这部分的参数吗?test的时候直接把test.py的模型改为与train里一样的就行了吗?

JasonForJoy commented 4 years ago

@cccccs 参数会保存 test和train模型要一致

cccccs commented 4 years ago

你好 还有一个问题 一个epoch要训练多少步,我1080训练一天了 一个epoch还没有训练完 我看代码似乎一个epoch也是一直while true。现在快一天了 训练参数结果如下:epoch: 0 n_update 116000 , train: Mins Used: 1175.91, Loss: 0.4219, Accuarcy: 51.02,租的服务器,太烧钱了。5个周期的话感觉要训练一周了。。 参数如下: -max_seq_length 300 \ --train_batch_size 7 \ --eval_batch_size 7 \

JasonForJoy commented 4 years ago

steps = data_size / batch_size = 1000,000 / 7 = 142858

cccccs commented 4 years ago

你好,请问batch_size是不是需要设定为data_size的约数,即保证到最后一个batch也为batch_size的大小? 也可能是因为我增加的网络中用到了batch_size为参数,现在训练到最后一个batch感觉应该是因为不到batch_size的大小出错了,想知道你训练集、验证集、测试集data_size的大小,谢谢。 (1000,000 %7==1) (0) Invalid argument: ConcatOp : Dimensions of inputs should match: shape[0] = [1,768] vs. shape[1] = [7,768] [[node rnn/while/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/concat (defined at /tmp/tmpNmdxXg.py:44) ]] [[loss/add/_847]] (1) Invalid argument: ConcatOp : Dimensions of inputs should match: shape[0] = [1,768] vs. shape[1] = [7,768]

JasonForJoy commented 4 years ago

@cccccs batch_size不一定需要设定为data_size的约数 Ubuntu V1 train/dev/test = 1M/0.5M/0.5M