chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.8k stars 421 forks source link

Feature Request: Support Cosine Similarity / Distance #16311

Open buddha314 opened 4 years ago

buddha314 commented 4 years ago

Scipy offers cosine similarity / distance on matrices. This would be useful in both dense and sparse matrices.

bradcray commented 4 years ago

As I understand it, this would be a feature request for the LinearAlgebra module specifically, I'm imagining, right @buddha314 ? Tagging @ben-albrecht on that basis.

buddha314 commented 4 years ago

I'm sure that will work, though I could imagine it being held in a kind of statistical tools module as well. I'm open!

damianmoz commented 4 years ago

Is this between two 2D-arrays or two 1D-arrays? How accurate do you want it to be?

buddha314 commented 4 years ago

As to dimension, if A is a matrix and x a vector, I often use d(A, x) = 1D in my calculations. I don't know if this is done by scipy as a matrix computation or a loop over the rows of A. In Chapel, I have written it as between 1D and iterated.

As to accuracy, I'm guessing you mean the type of the variable. If there is a discrimination beyond "float" I don't feel qualified to make it.

damianmoz commented 4 years ago

Inverting my own questions, there are accurate ways to calculate it which do not overflow or underflow, and there are naive ways which may underflow destructively to zero (when they should not) or overflow to infinity (when it should not).

I assume you meant that d(A, x) returns a 1D array? My limited experience is where both are vectors and it returns a scalar.

And the question is how many of the statistical kernels one should try and implement so you understand the interrelationships. That is a lot of work.

And do we have to cater for missing values? And how to you represent them?

ben-albrecht commented 4 years ago

I'm sure that will work, though I could imagine it being held in a kind of statistical tools module as well. I'm open!

I'd lean towards a statistics or spatial package (the latter following scipy's hierarchy convention).

buddha314 commented 4 years ago

Inverting my own questions, there are accurate ways to calculate it which do not overflow or underflow, and there are naive ways which may underflow destructively to zero (when they should not) or overflow to infinity (when it should not).

I don't think I can offer intelligent advice on that.

I assume you meant that d(A, x) returns a 1D array? My limited experience is where both are vectors and it returns a scalar.

In this examples, A is 2D, x is 1D and the result is a 1D array of distances between x and every row of A.

And the question is how many of the statistical kernels one should try and implement so you understand the interrelationships. That is a lot of work.

And do we have to cater for missing values? And how to you represent them?

I would suggest we mimic the behavior of scipy unless it conflicts with HPC best practices.

damianmoz commented 4 years ago

There are at least two types of users, those who work with missing values who use NaN to represent a missing value, and those who do not have missing values and for whom an NaN means a catastrophic screw-up in the program. And both use HPC.

How do you want missing values handled? Do you want them ignored in the calculation of the cosine distance or do you want a cosine similarity or distance to return a NaN whenever an NaN is seen in either your matrix A or vector d

Using an NaN for a missing value means that NaN propagation does not happen in the presence of infinity. Sadly, the statisticians have won the often acrimonious argument within the IEEE-754 standard which makes standard library routines like hypot a waste of space for the second class of users (and means that have to write our own). There are even more useless IEEE 754 dot product routines which fail to propagate NaNs.

So, if this has a statistical focus, it needs to handle missing values properly. If it has a linear algebra focus, it need to propagate NaNs properly. If it is to be used in both, there needs to be two versions of the routine. My guess is SciPy only does it one way. Exactly what best practice is an often moot point as best practice is often not best and in some scenarios is even the worst. If we have to follow only one, I agree with you and the Chapel module should follow SciPy but put in a massive disclaimer.

buddha314 commented 4 years ago

@damianmoz I have to admit, I cannot match your passion for this topic, but it's rather refreshing!

damianmoz commented 4 years ago

Sorry if it was over the top. But I wanted to make sure we captured the right way to approach your problem and we had the complete picture. Have you tried yourself to write a naive implementation of this function in Chapel using the norm and dot routines from the Norm and LinearAlgebra modules respectively? That should work as a first cut. Do you want similarity or distance?

Have you ever had any signed or unsigned Infinities in your data?