Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
According to the documentation of encode_text(), the BERTTokenizer should automatically truncate too long sequences. However, the code below fails with an exception. Am I misunderstanding the documentation, or is this a bug?
(I'm using Texar version 0.2.4)
Code
import texar.tf as tx
example_text = 'Lorem ipsum dolor sit amet, consetetur sadipscing elitr, ' \
'sed diam nonumy eirmod tempor invidunt ut labore et dolore ' \
'magna aliquyam erat, sed diam voluptua.'
for i in range(5):
example_text = example_text + ' ' + example_text
tokenizer = tx.data.BERTTokenizer(pretrained_model_name='bert-base-uncased')
input_ids, segment_ids, input_mask = tokenizer.encode_text(text_a=example_text, max_seq_length=512)
print(input_ids)
Exception
2019-12-21 16:31:15.713648: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
Using cached pre-trained BERT checkpoint from C:\Users\Bastian\AppData\Roaming\texar_data\BERT\bert-base-uncased.
Traceback (most recent call last):
File "E:/Workspaces/PycharmProjects/Texar-Test/example.py", line 11, in <module>
input_ids, segment_ids, input_mask = tokenizer.encode_text(text_a=example_text, max_seq_length=512)
File "E:\Workspaces\PycharmProjects\Texar-Test\venv\lib\site-packages\texar\tf\data\tokenizers\bert_tokenizer.py", line 197, in encode_text
token_ids_a = self.map_text_to_id(text_a)
File "E:\Workspaces\PycharmProjects\Texar-Test\venv\lib\site-packages\texar\tf\data\tokenizers\tokenizer_base.py", line 410, in map_text_to_id
return self.map_token_to_id(self.map_text_to_token(text))
File "E:\Workspaces\PycharmProjects\Texar-Test\venv\lib\site-packages\texar\tf\data\tokenizers\tokenizer_base.py", line 386, in map_token_to_id
"errors".format(len(ids), self.max_len))
ValueError: Token indices sequence length is longer than the specified maximum sequence length for this model (1856 > 512). Running this sequence through the model will result in indexing errors
According to the documentation of
encode_text()
, the BERTTokenizer should automatically truncate too long sequences. However, the code below fails with an exception. Am I misunderstanding the documentation, or is this a bug?(I'm using Texar version
0.2.4
)Code
Exception