bojone / bert4keras

keras implement of transformers for humans
https://kexue.fm/archives/6915
Apache License 2.0
5.36k stars 927 forks source link

run pretraining.py 报错,不知道啥问题导致 #351

Open Copilot-X opened 3 years ago

Copilot-X commented 3 years ago

提问时请尽可能提供如下信息:

基本信息

核心代码

# 代码是只有data_utils.py  和  pretraining.py 这两个文件,几乎没有啥改动,除了数据集
但是跑run pretraining.py的时候报错,不知道是不是版本问题导致,希望大佬告知一下呢

输出信息

2021-05-29 17:39:22.321821: W tensorflow/core/grappler/utils/graph_view.cc:830] No registered 'MultiDeviceIteratorFromStringHandle' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorFromStringHandle}}
        .  Registered:  device='CPU'

2021-05-29 17:39:22.322607: W tensorflow/core/grappler/utils/graph_view.cc:830] No registered 'MultiDeviceIteratorGetNextFromShard' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorGetNextFromShard}}
        .  Registered:  device='CPU'

Traceback (most recent call last):
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: buffer_size must be greater than zero.
         [[{{node ShuffleDataset_1}}]]
  (1) Invalid argument: buffer_size must be greater than zero.
         [[{{node ShuffleDataset_1}}]]
         [[MultiDeviceIteratorInit/_2057]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pretraining.py", line 325, in <module>
    callbacks=[checkpoint, csv_logger],
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 727, in fit
    use_multiprocessing=use_multiprocessing)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 685, in fit
    steps_name='steps_per_epoch')
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 144, in model_iteration
    input_iterator = _get_iterator(inputs, model._distribution_strategy)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 550, in _get_iterator
    inputs, distribution_strategy)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/keras/distribute/distributed_training_utils.py", line 588, in get_iterator
    initialize_iterator(iterator, distribution_strategy)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/keras/distribute/distributed_training_utils.py", line 596, in initialize_iterator
    K.get_session((init_op,)).run(init_op)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: buffer_size must be greater than zero.
         [[node ShuffleDataset_1 (defined at /mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Invalid argument: buffer_size must be greater than zero.
         [[node ShuffleDataset_1 (defined at /mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[MultiDeviceIteratorInit/_2057]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'ShuffleDataset_1':
  File "pretraining.py", line 325, in <module>
    callbacks=[checkpoint, csv_logger],
  File "/mnt/data//Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 727, in fit
    use_multiprocessing=use_multiprocessing)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 685, in fit
    steps_name='steps_per_epoch')
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 144, in model_iteration
    input_iterator = _get_iterator(inputs, model._distribution_strategy)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 550, in _get_iterator
    inputs, distribution_strategy)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/keras/distribute/distributed_training_utils.py", line 587, in get_iterator
    iterator = distribution_strategy.make_dataset_iterator(dataset)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1008, in make_dataset_iterator
    return self._extended._make_dataset_iterator(dataset)  # pylint: disable=protected-access
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 537, in _make_dataset_iterator
    split_batch_by=self._num_replicas_in_sync)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/distribute/input_lib.py", line 767, in __init__
    input_context=input_context)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/distribute/input_lib.py", line 563, in __init__
    input_context=input_context)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/distribute/input_lib.py", line 521, in __init__
    cloned_dataset = input_ops._clone_dataset(dataset)  # pylint: disable=protected-access
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/distribute/input_ops.py", line 57, in _clone_dataset
    remap_dict = _clone_helper(dataset._variant_tensor.op, variant_tensor_ops)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/distribute/input_ops.py", line 81, in _clone_helper
    recursive_map = _clone_helper(input_tensor_op, variant_tensor_ops)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/distribute/input_ops.py", line 81, in _clone_helper
    recursive_map = _clone_helper(input_tensor_op, variant_tensor_ops)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/distribute/input_ops.py", line 81, in _clone_helper
    recursive_map = _clone_helper(input_tensor_op, variant_tensor_ops)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/distribute/input_ops.py", line 97, in _clone_helper
    op_def=_get_op_def(op_to_clone))
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/mnt/data/faker/Relation_Extraction/venv_dir/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

自我尝试

不管什么问题,请先尝试自行解决,“万般努力”之下仍然无法解决再来提问。此处请贴上你的努力过程。 切换过很多版本,但是没有试用正确,头皮发麻,望大佬指导一下

bojone commented 3 years ago

就是数据问题,显示没数据。。。

Copilot-X commented 3 years ago

就是数据问题,显示没数据。。。

苏老师,看了代码发现了我参数设置的问题; batch-size 比 grad_accum_steps的值小,导致后面的数据一直为0; image

还有个疑问,看了您的batch-size设置为4096,这么大的batch-size,是用TPU来训练的吗?如果是24G的GPU显卡,那么相对应得batch-size 和 grad_accum_steps 这两个值都设为 8(这样设置可以?)