dmis-lab / biobert

Bioinformatics'2020: BioBERT: a pre-trained biomedical language representation model for biomedical text mining
http://doi.org/10.1093/bioinformatics/btz682
Other
1.93k stars 451 forks source link

Issues with tokenizers #163

Open galsk87 opened 3 years ago

galsk87 commented 3 years ago

Hi Thanks for the code and the models they are very useful!

I have an issue with the tokenizers in the cased version, it actually uses an uncased tokenizer, is that a mistake or intentional?

Code snippet: tokenizer = AutoTokenizer.from_pretrained('dmis-lab/biobert-base-cased-v1.1') tokenizer.pretrained_init_configuration

{'bert-base-uncased': {'do_lower_case': True}, 'bert-large-uncased': {'do_lower_case': True}, 'bert-base-cased': {'do_lower_case': False}, 'bert-large-cased': {'do_lower_case': False}, 'bert-base-multilingual-uncased': {'do_lower_case': True}, 'bert-base-multilingual-cased': {'do_lower_case': False}, 'bert-base-chinese': {'do_lower_case': False}, 'bert-base-german-cased': {'do_lower_case': False}, 'bert-large-uncased-whole-word-masking': {'do_lower_case': True}, 'bert-large-cased-whole-word-masking': {'do_lower_case': False}, 'bert-large-uncased-whole-word-masking-finetuned-squad': {'do_lower_case': True}, 'bert-large-cased-whole-word-masking-finetuned-squad': {'do_lower_case': False}, 'bert-base-cased-finetuned-mrpc': {'do_lower_case': False}, 'bert-base-german-dbmdz-cased': {'do_lower_case': False}, 'bert-base-german-dbmdz-uncased': {'do_lower_case': True}, 'TurkuNLP/bert-base-finnish-cased-v1': {'do_lower_case': False}, 'TurkuNLP/bert-base-finnish-uncased-v1': {'do_lower_case': True}, 'wietsedv/bert-base-dutch-cased': {'do_lower_case': False}}