bjascob / amrlib

A python library that makes AMR parsing, generation and visualization simple.
MIT License
216 stars 33 forks source link

This tokenizer was incorrectly instantiated with a model max length of 512 #49

Closed bjascob closed 5 months ago

bjascob commented 2 years ago

When updating from my system with transformers 4.19.4 (previously was 4.16.2) and Sentencepiece 0.1.96. I'm getting the following message... FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5. This only happens for the T5Tokenizer, not bart so it only impacts the generate_t5 and generate_t5wtense models.

To reproduce...

>>> from transformers import T5Tokenizer
>>> tokenizer = T5Tokenizer.from_pretrained('t5-base')

Note that this also happens with AutoTokenizer.from_pretrained('t5-base') (but not bart-base)

It looks like this is an issue with the transformers code, since T5 should have a max length of 512. Since I'm setting max_length=512 during tokenization it shouldn't be an issue anyway. Hopefully this message will go away with later updates to the transformers lib.

bjascob commented 5 months ago

Since T5 uses relative positional embeddings it can support longer sequences than 512, although it was trained with a max length of 512. To allow this you need to pass max_train_graph_len=512 to the tokenizer when loading.

Since both bart and t5 parsers are trained with max_train_sent_len=100 and max_train_graph_len=512, this only impacts inference and generation. With T5 being a rarely used model for parsing, this is a minor issues that has limited utility.

If T5 becomes useful in the future a change could be considered but for now close the issue as "won't fix"