ProsusAI / finBERT

Financial Sentiment Analysis with BERT
Apache License 2.0
1.45k stars 417 forks source link

Is finBert cased or uncased? #23

Closed sharon-gao closed 3 years ago

sharon-gao commented 4 years ago

Hi! Thanks for developing and sharing the codes.

I wonder which vanilla BERT model you used to post-training on financial domain text.

To be specific, I wonder whether this FinBERT model can tell the difference between uppercase and lowercase.

GillesJ commented 4 years ago

I am not the originator of this code but I figured this out yesterday by looking into finbert/finbert.py.

The tokenizer is instantiated in FinBert.prepare_model as self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=self.config.do_lower_case) and the init function of the Config class uses do_lower_case=True, as default which does lower casing pre-processing on the input text as would be required for an uncased model.

So it uses bert_base_uncased as the tokenizer/vocabulary and lowercases the input, so it cannot tell the difference between lower and upper case.