generateme / fastmath

Fast primitive based math library
MIT License
238 stars 12 forks source link

consider adding soft-cosine distance #21

Open behrica opened 2 years ago

behrica commented 2 years ago

Useful in comparing TFIDF text representations, instead of using cosine

https://en.wikipedia.org/wiki/Cosine_similarity#Soft_cosine_measure

The similarity function s_i_j should be plugable (as input to the function)

genmeblog commented 2 years ago

Thanks for the idea! I think I need your support here. I understand the definition. However I have no idea how to build convenient API for that. Set of examples would be helpful.

behrica commented 2 years ago

I think it should simply allow to plugin in any (distance) function which takes 2 values and returns a float.

(soft-cosine  [1 2 3 4]   [ 2 3 5 6]    (fn [x y]  .... do-somthing-to caluòate distance of x and y ))

Concrete case comes from NLP-

A language aware function:

(defn word-dist [token-1 token-2] ... ) with this spec

(word-dist   "I"  "I") = 1
(word-dist   "like"  "like") = 1
(word-dist   "I"  "like") = 
(word-dist   "fruits"  "banana") = 0.5

(soft-cosine [ "I" "like" "fruits"] ["I" "like" "banana"] word-dist) = .... > 0.6 (not sure about concrete number) It would compare "I" -> "I" = 1 "like" -> "like" = 1 "fruits" -> banana" = 0.5

In practice we would map all tokens to number first (this makes the vocabulary), so the soft-cosine would be called with vectors of ints in this case. (if token frequency is used) or floats, if tfidf is used.

behrica commented 2 years ago

I found here: https://github.com/TeamCohen/secondstring/blob/master/src/com/wcohen/ss/SoftTFIDF.java

an old Java implementation which combines TFID and soft-cosine

I would prefer to have this separated.

The tfidf part we have already here: https://github.com/scicloj/scicloj.ml.smile/blob/main/src/scicloj/ml/smile/nlp.clj#L285

This gives me the 2 vectors above, I want to get the distance for.

The "classical" way is to use simple cosine distance, but this is then not able to deal with "similarity of tokens". The only way to do hat would be to "normalize" the vocabulary before, and somehow say that "fruits" and "banana" is the same thing, and remove one. But his is a too strict normalisation.

SoftCosine should be better , hopefully.

behrica commented 2 years ago

an other exmaple would be to plugin in text embeddings (word2vec). They can calculate as well a semantic distance between any 2 words.

There is an Java implmentation here, and so I would plugin this concrete function: https://javadoc.io/static/org.deeplearning4j/deeplearning4j-nlp/1.0.0-M2.1/org/deeplearning4j/models/embeddings/wordvectors/WordVectors.html#similarity(java.lang.String,java.lang.String)

(just doing first the mapping to the vocabulary token<->index)