bojone / bert4keras

keras implement of transformers for humans
https://kexue.fm/archives/6915
Apache License 2.0
5.37k stars 926 forks source link

将bert4keras包更新到最新后,对roberta进行预训练遇到问题。 #84

Closed Fan9 closed 4 years ago

Fan9 commented 4 years ago

File "/root/anaconda3/envs/tf2/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap self.run() File "/root/anaconda3/envs/tf2/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs)

python 3.6 keras 2.3.1 tensorflow 1.13.2 运行data_utiils.py报错 File "/root/anaconda3/envs/tf2/lib/python3.6/multiprocessing/pool.py", line 103, in worker initializer(*initargs) File "/data/sfang/Pretrain/bert4keras/snippets.py", line 166, in worker_step r = func(d) File "/data/sfang/Pretrain/pretraining/data_utils.py", line 117, in paragraph_process instances = self.paragraph_process(texts) File "/data/sfang/Pretrain/pretraining/data_utils.py", line 209, in paragraph_process return super(TrainingDatasetRoBERTa, self).paragraph_process(texts, starts, ends, paddings) File "/data/sfang/Pretrain/pretraining/data_utils.py", line 53, in paragraph_process sub_instance = self.sentence_process(text) File "/data/sfang/Pretrain/pretraining/data_utils.py", line 188, in sentence_process add_sep=False) TypeError: tokenize() got an unexpected keyword argument 'add_cls'

bojone commented 4 years ago

刚修正

Fan9 commented 4 years ago

感谢苏神,目前测试data_utils.py已正常 但在执行pretraining.py报错: Traceback (most recent call last): File "pretraining.py", line 212, in train_model = build_bert_model_for_pretraining() File "pretraining.py", line 189, in build_bert_model_for_pretraining optimizer = optimizer(optimizer_params) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/bert4keras/optimizers.py", line 442, in init super(new_optimizer, self).init(*args, *kwargs) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/bert4keras/optimizers.py", line 353, in init super(new_optimizer, self).init(args, kwargs) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/bert4keras/optimizers.py", line 255, in init super(new_optimizer, self).init(*args, *kwargs) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/bert4keras/optimizers.py", line 149, in init super(new_optimizer, self).init(args, kwargs) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/bert4keras/optimizers.py", line 25, in init super(Adam, self).init(kwargs) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/keras/optimizers.py", line 68, in init 'passed to optimizer: ' + str(k)) TypeError: Unexpected keyword argument passed to optimizer: name

应该是tensorflow版本的问题将tensorflow 1.13.2升级到 1.14.0后不在报optimizer的错误。但又出现了如下错误:

mlm_loss (Lambda) () 0 token_ids[0][0] MLM-Proba[0][0] is_masked[0][0]


mlm_acc (Lambda) () 0 token_ids[0][0] MLM-Proba[0][0] is_masked[0][0]

Total params: 325,545,608 Trainable params: 325,545,608 Non-trainable params: 0


2020-03-12 15:04:21.711097: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. Traceback (most recent call last): File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call return fn(*args) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: buffer_size must be greater than zero. [[{{node ShuffleDataset_1}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "pretraining.py", line 233, in callbacks=[checkpoint, csv_logger], File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 649, in fit validation_freq=validation_freq) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_distributed.py", line 143, in fit_distributed steps_name='steps_per_epoch') File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 142, in model_iteration input_iterator = _get_iterator(inputs, model._distribution_strategy) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 516, in _get_iterator inputs, distribution_strategy) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/keras/distribute/distributed_training_utils.py", line 534, in get_iterator initialize_iterator(iterator, distribution_strategy) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/keras/distribute/distributed_training_utils.py", line 542, in initialize_iterator K.get_session((init_op,)).run(init_op) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run run_metadata_ptr) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run run_metadata) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: buffer_size must be greater than zero. [[node ShuffleDataset_1 (defined at pretraining.py:233) ]]

Errors may have originated from an input operation. Input Source operations connected to node ShuffleDataset_1: seed (defined at /data/sfang/Pretrain/pretraining/data_utils.py:142)

Original stack trace for 'ShuffleDataset_1': File "pretraining.py", line 233, in callbacks=[checkpoint, csv_logger], File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 649, in fit validation_freq=validation_freq) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_distributed.py", line 143, in fit_distributed steps_name='steps_per_epoch') File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 142, in model_iteration input_iterator = _get_iterator(inputs, model._distribution_strategy) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 516, in _get_iterator inputs, distribution_strategy) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/keras/distribute/distributed_training_utils.py", line 533, in get_iterator iterator = distribution_strategy.make_dataset_iterator(dataset) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py", line 732, in make_dataset_iterator return self._extended._make_dataset_iterator(dataset) # pylint: disable=protected-access File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 591, in _make_dataset_iterator split_batch_by=self._num_replicas_in_sync) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/distribute/input_lib.py", line 600, in init input_context=input_context) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/distribute/input_lib.py", line 491, in init *kwargs) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/distribute/input_lib.py", line 400, in init cloned_dataset = input_ops._clone_dataset(dataset) # pylint: disable=protected-access File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/distribute/input_ops.py", line 57, in _clone_dataset remap_dict = _clone_helper(dataset._variant_tensor.op, variant_tensor_ops) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/distribute/input_ops.py", line 82, in _clone_helper recursive_map = _clone_helper(input_tensor_op, variant_tensor_ops) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/distribute/input_ops.py", line 82, in _clone_helper recursive_map = _clone_helper(input_tensor_op, variant_tensor_ops) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/distribute/input_ops.py", line 98, in _clone_helper op_def=_get_op_def(op_to_clone)) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op op_def=op_def) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in init self._traceback = tf_stack.extract_stack()

Fan9 commented 4 years ago

不知道是否是版本的问题。目前的版本:tensorflow-gpu 1.14.0 python 3.6 keras 2.3.1

bojone commented 4 years ago

不清楚,我也是用这个tf版本

Fan9 commented 4 years ago

苏神的keras版本呢? 我整个一样的 再来调试。避免版本问题

bojone commented 4 years ago

苏神的keras版本呢? 我整个一样的 再来调试。避免版本问题

预训练用不到keras,都是用tf.keras的

Fan9 commented 4 years ago

已经调通了,并成功运行了。还是要好好看看代码,要不浪费更多的时间。感谢

bojone commented 4 years ago

已经调通了,并成功运行了。还是要好好看看代码,要不浪费更多的时间。感谢

恭喜恭喜,赞一个。

Fan9 commented 4 years ago

再次预训练后保存的模型文件如何使用? 我目前是将训练好的模型文件替换官方的ckpt文件,保留json,和vocab文件。 但是在使用时报错。 是不是要修改保存模型参数的部分, 2020-03-19 13:29:24.795751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10470 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) ==> searching: bert/embeddings/token_type_embeddings, found name: layer_with_weights-1/embeddings/.ATTRIBUTES/VARIABLE_VALUE ==> searching: bert/embeddings/position_embeddings, found name: layer_with_weights-2/embeddings/.ATTRIBUTES/VARIABLE_VALUE ==> searching: bert/embeddings/LayerNorm/gamma, found name: layer_with_weights-0/embeddings/.OPTIMIZER_SLOT/optimizer/m/.ATTRIBUTES/VARIABLE_VALUE ==> searching: bert/embeddings/LayerNorm/beta, found name: layer_with_weights-0/embeddings/.OPTIMIZER_SLOT/optimizer/v/.ATTRIBUTES/VARIABLE_VALUE Traceback (most recent call last): File "train_googloe_bert.py", line 75, in return_keras_model=False, File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/bert4keras/bert.py", line 535, in build_bert_model bert.load_weights_from_checkpoint(checkpoint_path) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/bert4keras/bert.py", line 399, in load_weights_from_checkpoint K.batch_set_value(zip(weights, values)) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2960, in batch_set_value tf_keras_backend.batch_set_value(tuples) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 2875, in batch_set_value assign_op = x.assign(assign_placeholder) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 915, in assign self._shape.assert_is_compatible_with(value_tensor.shape) File "/root/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/tensor_shape.py", line 1023, in assert_is_compatible_with raise ValueError("Shapes %s and %s are incompatible" % (self, other)) ValueError: Shapes (768,) and (21128, 768) are incompatible 参数大小不匹配

Fan9 commented 4 years ago

是不是在读取时只能使用keras.load_models() 不能再用build_transformer_model的方式了呢

bojone commented 4 years ago

可以重新建立模型,然后加载模型权重,最后用save_weights_as_checkpoint方法导出跟官方权重一致的ckpt格式。

Fan9 commented 4 years ago

嗯嗯。 已经成功了。 还有个疑问,在预训练时MLM的精度一路从46%提升至90%以上。后将进一步预训练好的模型用于下游任务时。效果很差。可能是什么原因导致的呢

bojone commented 4 years ago

嗯嗯。 已经成功了。 还有个疑问,在预训练时MLM的精度一路从46%提升至90%以上。后将进一步预训练好的模型用于下游任务时。效果很差。可能是什么原因导致的呢

正常来说MLM的准确率也就是五六十左右,你能跑到90有点不正常。。。

zjcanjux commented 4 years ago

save_weights_as_checkpoint

class ModelCheckpoint(keras.callbacks.Callback): """自动保存最新模型 """ def on_epoch_end(self, epoch, logs=None):

self.model.save_weights(model_saved_path, overwrite=True)

    self.model.save_weights_as_checkpoint(model_saved_path)

苏神,还想针对这个再次pretrain后的模型怎么加载的问题再请教下,我现在已经完成了再次pretrain,但无法加载,你说的save_weights_as_checkpoint,是否可以在pretraining.py中的这个方法里改,然后这样生成的ckpt就可以build_transformer_model加载了呢?如果不是,能否再稍微说的详细点,如何转换加载,非常感谢。

bojone commented 4 years ago

build_transformer_model

1、建立同样的模型; 2、bert.model.load_weights加载你之前保存的权重; 3、bert.save_weights_as_checkpoint保存为新的权重。

实在搞不明白,就多看看bert4keras/models.py的源码~

zjcanjux commented 4 years ago

build_transformer_model

1、建立同样的模型; 2、bert.model.load_weights加载你之前保存的权重; 3、bert.save_weights_as_checkpoint保存为新的权重。

实在搞不明白,就多看看bert4keras/models.py的源码~

感谢,已经可以加载了。

dolphin-Jia commented 2 years ago

嗯嗯。 已经成功了。 还有个疑问,在预训练时MLM的精度一路从46%提升至90%以上。后将进一步预训练好的模型用于下游任务时。效果很差。可能是什么原因导致的呢

请问后续怎么解决的,我实在排查不出错误了