model_max_len parameter in tokenizer

EmilyAlsentzer / clinicalBERT

repository for Publicly Available Clinical BERT Embeddings

MIT License

658 stars 134 forks source link

model_max_len parameter in tokenizer #38

Closed ishpiki closed 3 years ago

ishpiki commented 3 years ago

Hello,

the tokenizer has model_max_len=1000000000000000019884624838656: tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

PreTrainedTokenizerFast(name_or_path='emilyalsentzer/Bio_ClinicalBERT', vocab_size=28996, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

However, it was mentioned in the https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT that maximum sequence length is 128. Could you please explain this moment?

Thanks!

EmilyAlsentzer commented 3 years ago

copying from our email exchange in case anyone else has this question:

We initialized our models with BERT, which has a max sequence length of 512, but used a max sequence length of 128 to train our models. My guess is that what you're seeing is a function of Huggingface's API changing since I uploaded the clinicalBERT model. I think the "model_max_len" parameter wasn't required when I first uploaded the model, and since it's missing, it's set to a very large integer. To use clinicalBERT, just make sure to specify the max length when you load the tokenizer. Hope this helps.

MrSworder commented 2 years ago

copying from our email exchange in case anyone else has this question:

We initialized our models with BERT, which has a max sequence length of 512, but used a max sequence length of 128 to train our models. My guess is that what you're seeing is a function of Huggingface's API changing since I uploaded the clinicalBERT model. I think the "model_max_len" parameter wasn't required when I first uploaded the model, and since it's missing, it's set to a very large integer. To use clinicalBERT, just make sure to specify the max length when you load the tokenizer. Hope this helps.

I want to load my local tokenizer config (vocab_file,tokenizer_file,padding_side and so on). So how can I set the value of the parameter model_max_len? I define my tokenizer with ElectraTokenizerFast whose root is PreTrainedTokenizerFast from transformers.

EmilyAlsentzer commented 2 years ago

I'm not sure I fully understand your question, but model_max_len is a parameter you set when you instantiate a tokenizer. See https://huggingface.co/transformers/internal/tokenization_utils.html?highlight=model_max_len