Closed sharon-gao closed 3 years ago
I am not the originator of this code but I figured this out yesterday by looking into finbert/finbert.py
.
The tokenizer is instantiated in FinBert.prepare_model as self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=self.config.do_lower_case)
and the init function of the Config
class uses do_lower_case=True,
as default which does lower casing pre-processing on the input text as would be required for an uncased model.
So it uses bert_base_uncased
as the tokenizer/vocabulary and lowercases the input, so it cannot tell the difference between lower and upper case.
Hi! Thanks for developing and sharing the codes.
I wonder which vanilla BERT model you used to post-training on financial domain text.
To be specific, I wonder whether this FinBERT model can tell the difference between uppercase and lowercase.