abhijeet3922 / finbert_embedding

Token and sentence level embeddings from FinBERT model (Finance Domain)
MIT License
37 stars 12 forks source link

Sentance embeddings and [CLS] token? #3

Open HartBlanc opened 4 years ago

HartBlanc commented 4 years ago

The current implementation of sentance_vector returns the mean of the token vectors as the sentence vector, however, users likely will need sentence embeddings for downstream classification tasks.

In the FinBERT paper and in the BERT paper, the authors suggest that downstream classification tasks should use the embedding for the [CLS] token as input for classification tasks.

Is there a specific use case in which the mean of the token embeddings would be preferred to the [CLS] token? In any case, a method which returns the embedding of the [CLS] token would likely be a useful function to add to the library.

abhijeet3922 commented 4 years ago

Yes, I agree. That has been consensus to use [CLS] token embedding for downstream classification task. But I would say, if you want to train classifier, you should go for fine-tuning the language model like BERT (or finbert here) and not go for getting embedding and training some classifier. Actually depends on use-case. Check sample notebooks here for fine-tuning language models for classification.

Also, embeddings have other use cases in clustering or similarity matching OR finding synonyms OR word sense disambiguation etc.

HartBlanc commented 4 years ago

Thanks for the explanation, and recommending the fine-tuning resource, I'll check it out! However, I thought it was interesting that the FinBERT paper found that fine-tuning to domain/task-specific corpus did not add much improvement in performance, for my tasks I think I will start with just using the [CLS] token embedding and see if that is sufficiently powerful.

The different use cases definitely could be compelling reasons to keep the mean word vector, I hadn't considered them. I'm not really clear on how to interpret the [CLS] token embedding, and therefore how to compare the mean of the word embeddings and the [CLS] token embedding for different tasks. My intuition would be that [CLS] embedding would be a better syntactic representation of the sentence than a simple mean because it is produced via training but it could very well not be the case for some of the use cases you mentioned, particularly use cases where lexcial-level semantics are more important than sentence level.