Open monologg opened 4 years ago
Hey @monologg, can you try using allenai/scibert_scivocab_uncased
? These two models actually have different vocabularies/weights, so it's not just a matter of different Tokenizer setting.
Hey @monologg, can you try using
allenai/scibert_scivocab_uncased
? These two models actually have different vocabularies/weights, so it's not just a matter of different Tokenizer setting.
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
>>> tokenizer.basic_tokenizer.do_lower_case
True
>>> tokenizer.tokenize("Hello World")
['hell', '##o', 'world']
# Forcing uncased model tokenizer not to lowercase the sentence
>>> tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', do_lower_case=False)
>>> tokenizer.tokenize("Hello World")
['[UNK]', '[UNK]']
It seems that without giving additional argument or model is not in PRETRAINED_INIT_CONFIGURATION
in tokenization_bert.py
, BertTokenizer
set do_lower_case
as True
which is default value.
I ran into the same problem and ruined my 6-days of pre-training because I wrongly assumed that do_lower_case=False
would be set given that I am using the cased version of SciBERT...
I've opened a PR on huggingface to solve this issue, please have a look: https://huggingface.co/allenai/scibert_scivocab_cased/discussions/3
Hi:)
I was using the
scibert_scivocab_cased
model on Huggingface library, and I've found out thatAutoTokenizer
can't setdo_lower_case
option asFalse
automatically.For
AutoTokenizer
orBertTokenizer
to setdo_lower_case=False
automatically, it seems thattokenizer_config.json
file should also be uploaded on model directory (Reference from Transformers library issue). The file should be written as below.Or the method below (tokenizer.save_pretrained) will generate
tokenizer_config.json
. (also makespecial_tokens_map.json
)Can you please check this issue? Thank you for sharing the model:)