Open behrica opened 2 years ago
Thanks for the idea! I think I need your support here. I understand the definition. However I have no idea how to build convenient API for that. Set of examples would be helpful.
I think it should simply allow to plugin in any (distance) function which takes 2 values and returns a float.
(soft-cosine [1 2 3 4] [ 2 3 5 6] (fn [x y] .... do-somthing-to caluòate distance of x and y ))
Concrete case comes from NLP-
A language aware function:
(defn word-dist [token-1 token-2] ... ) with this spec
(word-dist "I" "I") = 1
(word-dist "like" "like") = 1
(word-dist "I" "like") =
(word-dist "fruits" "banana") = 0.5
(soft-cosine [ "I" "like" "fruits"] ["I" "like" "banana"] word-dist) = .... > 0.6 (not sure about concrete number) It would compare "I" -> "I" = 1 "like" -> "like" = 1 "fruits" -> banana" = 0.5
In practice we would map all tokens to number first (this makes the vocabulary),
so the soft-cosine
would be called with vectors of ints in this case. (if token frequency is used)
or floats, if tfidf is used.
I found here: https://github.com/TeamCohen/secondstring/blob/master/src/com/wcohen/ss/SoftTFIDF.java
an old Java implementation which combines TFID and soft-cosine
I would prefer to have this separated.
The tfidf part we have already here: https://github.com/scicloj/scicloj.ml.smile/blob/main/src/scicloj/ml/smile/nlp.clj#L285
This gives me the 2 vectors above, I want to get the distance for.
The "classical" way is to use simple cosine distance, but this is then not able to deal with "similarity of tokens". The only way to do hat would be to "normalize" the vocabulary before, and somehow say that "fruits" and "banana" is the same thing, and remove one. But his is a too strict normalisation.
SoftCosine should be better , hopefully.
an other exmaple would be to plugin in text embeddings (word2vec). They can calculate as well a semantic distance between any 2 words.
There is an Java implmentation here, and so I would plugin this concrete function: https://javadoc.io/static/org.deeplearning4j/deeplearning4j-nlp/1.0.0-M2.1/org/deeplearning4j/models/embeddings/wordvectors/WordVectors.html#similarity(java.lang.String,java.lang.String)
(just doing first the mapping to the vocabulary token<->index)
Useful in comparing TFIDF text representations, instead of using cosine
https://en.wikipedia.org/wiki/Cosine_similarity#Soft_cosine_measure
The similarity function s_i_j should be plugable (as input to the function)