asyml / texar

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
https://asyml.io
Apache License 2.0
2.39k stars 374 forks source link

BERTTokenizer is not properly truncating too long inputs #264

Closed Bastian closed 4 years ago

Bastian commented 4 years ago

According to the documentation of encode_text(), the BERTTokenizer should automatically truncate too long sequences. However, the code below fails with an exception. Am I misunderstanding the documentation, or is this a bug?

(I'm using Texar version 0.2.4)

Code

import texar.tf as tx

example_text = 'Lorem ipsum dolor sit amet, consetetur sadipscing elitr, ' \
               'sed diam nonumy eirmod tempor invidunt ut labore et dolore ' \
               'magna aliquyam erat, sed diam voluptua.'

for i in range(5):
    example_text = example_text + ' ' + example_text

tokenizer = tx.data.BERTTokenizer(pretrained_model_name='bert-base-uncased')
input_ids, segment_ids, input_mask = tokenizer.encode_text(text_a=example_text, max_seq_length=512)

print(input_ids)

Exception

2019-12-21 16:31:15.713648: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
Using cached pre-trained BERT checkpoint from C:\Users\Bastian\AppData\Roaming\texar_data\BERT\bert-base-uncased.
Traceback (most recent call last):
  File "E:/Workspaces/PycharmProjects/Texar-Test/example.py", line 11, in <module>
    input_ids, segment_ids, input_mask = tokenizer.encode_text(text_a=example_text, max_seq_length=512)
  File "E:\Workspaces\PycharmProjects\Texar-Test\venv\lib\site-packages\texar\tf\data\tokenizers\bert_tokenizer.py", line 197, in encode_text
    token_ids_a = self.map_text_to_id(text_a)
  File "E:\Workspaces\PycharmProjects\Texar-Test\venv\lib\site-packages\texar\tf\data\tokenizers\tokenizer_base.py", line 410, in map_text_to_id
    return self.map_token_to_id(self.map_text_to_token(text))
  File "E:\Workspaces\PycharmProjects\Texar-Test\venv\lib\site-packages\texar\tf\data\tokenizers\tokenizer_base.py", line 386, in map_token_to_id
    "errors".format(len(ids), self.max_len))
ValueError: Token indices sequence length is longer than the specified maximum sequence length for this model (1856 > 512). Running this sequence through the model will result in indexing errors
gpengzhi commented 4 years ago

Thank you for your interest in Texar! Yes, I think it is a bug. We will fix it accordingly.