Closed pruksmhc closed 5 years ago
Bump on this! @EmilyAlsentzer
We used scispacy for sentence splitting during preprocessing (to format the text for pretraining on MIMIC data), but did not use scispacy's tokenizer for downstream tasks. We wanted to use the same tokenizer that BERT-base used (one that does text normalization, punctuation splitting, and word piece tokenization) to be consistent with the BERT vocab and pretrained models.
For MedNLI, it seems as though you had used tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case). Is it correct to say that the bert tokenizer you used for MedNLI is bert-base-cased as opposed to scispacy? If so, what is the thinking behind this?