Closed ishpiki closed 3 years ago
copying from our email exchange in case anyone else has this question:
We initialized our models with BERT, which has a max sequence length of 512, but used a max sequence length of 128 to train our models. My guess is that what you're seeing is a function of Huggingface's API changing since I uploaded the clinicalBERT model. I think the "model_max_len" parameter wasn't required when I first uploaded the model, and since it's missing, it's set to a very large integer. To use clinicalBERT, just make sure to specify the max length when you load the tokenizer. Hope this helps.
copying from our email exchange in case anyone else has this question:
We initialized our models with BERT, which has a max sequence length of 512, but used a max sequence length of 128 to train our models. My guess is that what you're seeing is a function of Huggingface's API changing since I uploaded the clinicalBERT model. I think the "model_max_len" parameter wasn't required when I first uploaded the model, and since it's missing, it's set to a very large integer. To use clinicalBERT, just make sure to specify the max length when you load the tokenizer. Hope this helps.
I want to load my local tokenizer config (vocab_file,tokenizer_file,padding_side and so on). So how can I set the value of the parameter model_max_len? I define my tokenizer with ElectraTokenizerFast whose root is PreTrainedTokenizerFast from transformers.
I'm not sure I fully understand your question, but model_max_len
is a parameter you set when you instantiate a tokenizer. See https://huggingface.co/transformers/internal/tokenization_utils.html?highlight=model_max_len
Hello,
the tokenizer has
model_max_len=1000000000000000019884624838656
:tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
However, it was mentioned in the https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT that maximum sequence length is 128. Could you please explain this moment?
Thanks!