abhijeet3922 / finbert_embedding

Token and sentence level embeddings from FinBERT model (Finance Domain)
MIT License
37 stars 12 forks source link

FinBERT 1.1.4 and semantical closeness #5

Open redskate opened 3 years ago

redskate commented 3 years ago

Dear all

I am quite navigated in language processing and computer science - but quite a newbye in using embeddings. I discovered with great astonishment FinBERT embedding as a way to compute semantical closeness of financial words. So I downloaded and installed FinBERT embedding 1.1.4 - the version I could install in pyhton3 - it correctly instanciates, tokenizes and computes embeddings for English natural sentences (Thanks).

One word embedding is a vector of 768 floats corresponding to the closeness (0 to 1) of that word in FinBERT to "some" features which were pretrained/precalculated.

My first question is: How can I exactely get those 768 features ? in order to read/process them?

The second and last question here is - is wether the semantical closeness reflected by a FinBERT embedding is in some way useful ? please see below.

I will give here some more data of course to to make me understood about what I mean. What I do in python is:

First I consider two fix sentences - a "sell sentence" and a "buy sentence" like that: sell-sentence: "sell a product" buy-sentence: "buy a product"

I tokenize and store the embeddings for token "sell" in sell-sentence and for token "buy" in buy-sentence. These embeddings are for me in this experiment reference embeddings gathering a "sell-embedding" and a "buy-embedding".

Finally I take an English financial sentence like ConocoPhillips in its Annual report disclosed the selling price for its 50% stake in Polar Lights of US$98m., tokenize it - as e.g. explained in https://pypi.org/project/finbert-embedding/ - then I calculate for each tokenized word its embedding - word_emb - and the distance via cosine (closeness) between the sell-embedding and that word_emb, and I repeat that cosine using the buy-embedding.

More precisely I calculate the "distance" from a given word and both reference words using the simple formulas distance_sell = 1 - cosine(word_emb, sell_emb) distance_buy = 1 - cosine(word_emb, buy_emb)

This should reflect my intention to calculate which words are closer to sell and which closer to buy.

What comes out are the following ordered distance values divided into two blocks, one for the sell-case and the other for the buy-case:

Words ranking relative to 'sell': 0.2604564428329468: '98m' 0.26267170906066895: 'stake' 0.2663293778896332: 'annual' 0.2762104272842407: 'its' 0.27889302372932434: 'lights' 0.2883516252040863: 'its' 0.2906721830368042: 'polar' 0.31022322177886963: 'the' 0.3269979953765869: 'report' 0.3717109262943268: 'for' 0.388060599565506: 'price' 0.4035351574420929: 'disclosed' 0.4900970458984375: 'selling'

Words ranking relative to 'buy': 0.25901156663894653: 'annual' 0.2632734477519989: 'its' 0.26385512948036194: 'polar' 0.2715086340904236: 'report' 0.27164217829704285: 'stake' 0.2741144001483917: 'its' 0.275168776512146: 'the' 0.277095228433609: '98m' 0.2864808440208435: 'lights' 0.332526296377182: 'price' 0.35841628909111023: 'disclosed' 0.3584253787994385: 'for' 0.4045600891113281: 'selling'

What surprises me are the following observations: a) the distance of a pronoun (e.g. "its") varies depending on the position of its token in the sencence [buy block] b) the distance of the article "the" is 0.275 while the distance of the word "selling" is just 0.404 [buy block] c) the distance of "selling" is just 0.490 instead of some higher values e.g. 0.999 [sell block] d) "selling" is the word closest to "sell" but is also the word closest to "buy" [both blocks] so considering in a financtial context "sell" is quite opposite to "buy", having a common in an abstracted class "transaction" ... the outcome is really weird to me.

So these outcomes tell me, I cannot rely upon a ready-out-of-the-shelf-FinBERT embedding this way. So I keep asking me (and you): For which purposes should such an embedding be used ?