Clarification on Tokenizer for MedNLI

EmilyAlsentzer / clinicalBERT

repository for Publicly Available Clinical BERT Embeddings

MIT License

673 stars 135 forks source link

Clarification on Tokenizer for MedNLI #8

Closed pruksmhc closed 5 years ago

pruksmhc commented 5 years ago

For MedNLI, it seems as though you had used tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case). Is it correct to say that the bert tokenizer you used for MedNLI is bert-base-cased as opposed to scispacy? If so, what is the thinking behind this?

pruksmhc commented 5 years ago

Bump on this! @EmilyAlsentzer

EmilyAlsentzer commented 5 years ago

We used scispacy for sentence splitting during preprocessing (to format the text for pretraining on MIMIC data), but did not use scispacy's tokenizer for downstream tasks. We wanted to use the same tokenizer that BERT-base used (one that does text normalization, punctuation splitting, and word piece tokenization) to be consistent with the BERT vocab and pretrained models.