JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
374 stars 96 forks source link

Implementation of cosine similarity? #215

Closed hhaensel closed 1 year ago

hhaensel commented 4 years ago

I needed the calculation of cosine similarity. My first attempt was a bare implementation of a wikpedia article. But I found out, that this was not as fast as desired (approx. 60s). Finally, I found a way to improve speed by three orders of magnitude by applying a matrix algorithm. If I did my maths correctly, the following function does the job:

function cos_similarity(tfidf::AbstractMatrix})
    cs = tfidf * tfidf'
    d = sqrt.(diag(cs))
    # prevent division by zero  (only occurs for empty documents)
    d[findall(iszero, d)] .= 1
    cs .= cs ./ (d * d')
end

In case that some people find it useful, I'd be happy to submit a PR.

aviks commented 3 years ago

Will be very useful, please submit a PR, preferably with some tests and docs.

hhaensel commented 3 years ago

@aviks Where's the best location to put it, utils.jl or tf_idf.jl or shall I include a new file similarity.jl?

aviks commented 3 years ago

Tf-idf.jl would be best, I think