Closed bjascob closed 5 months ago
Since T5 uses relative positional embeddings it can support longer sequences than 512, although it was trained with a max length of 512. To allow this you need to pass max_train_graph_len=512
to the tokenizer when loading.
Since both bart and t5 parsers are trained with max_train_sent_len=100 and max_train_graph_len=512, this only impacts inference and generation. With T5 being a rarely used model for parsing, this is a minor issues that has limited utility.
If T5 becomes useful in the future a change could be considered but for now close the issue as "won't fix"
When updating from my system with transformers 4.19.4 (previously was 4.16.2) and Sentencepiece 0.1.96. I'm getting the following message... FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5. This only happens for the T5Tokenizer, not bart so it only impacts the
generate_t5
andgenerate_t5wtense
models.To reproduce...
Note that this also happens with AutoTokenizer.from_pretrained('t5-base') (but not bart-base)
It looks like this is an issue with the transformers code, since T5 should have a max length of 512. Since I'm setting max_length=512 during tokenization it shouldn't be an issue anyway. Hopefully this message will go away with later updates to the transformers lib.