allenai / scibert

A BERT model for scientific text.
https://arxiv.org/abs/1903.10676
Apache License 2.0
1.47k stars 214 forks source link

max_len returns unexpected value #93

Open JohnGiorgi opened 4 years ago

JohnGiorgi commented 4 years ago

Hi,

I noticed something weird about the max_len attribute of the tokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")
print(tokenizer.max_len)  # => 1000000000000000019884624838656

Whereas I expected it to be 512, as in

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(tokenizer.max_len)  # => 512

is this a bug? Or is max_len not the appropriate attribute to use if I want to know the max length for the inputs of the model?