ashvardanian / SimSIMD

Up to 200x Faster Inner Products and Vector Similarity — for Python, JavaScript, Rust, C, and Swift, supporting f64, f32, f16 real & complex, i8, and binary vectors using SIMD for both x86 AVX2 & AVX-512 and Arm NEON & SVE 📐
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
794 stars 42 forks source link

Question on Cosine Similarity Result #112

Closed edwinkys closed 2 months ago

edwinkys commented 2 months ago

First of all, thank you for creating and maintaining this project. It helped a lot for my SIMD implementation on distance functions for vectors.

I encounter some oddity when it comes to using f32::cosine in Rust. When comparing it to the manual cosine similarity calculation, it produces different result.

use simsimd::SpatialSimilarity;

fn main() {
    let a = vec![1.0, 3.0, 5.0];
    let b = vec![2.0, 4.0, 6.0];

    let dot = f32::dot(&a, &b).unwrap() as f32;
    let ma = a.iter().map(|x| x.powi(2)).sum::<f32>().sqrt();
    let mb = b.iter().map(|x| x.powi(2)).sum::<f32>().sqrt();
    let cosine = dot / (ma * mb);

    assert_eq!(cosine, f32::cosine(&a, &b).unwrap() as f32);
}

I'm just curious, is there something that I miss from the implementation?

Note: The f32::dot and f32::sqeuclidean do produced the correct result compared to manual calculation.

ashvardanian commented 2 months ago

Thank you, @edwinkys! What is the result you are getting?

edwinkys commented 2 months ago

Thank you for the fast reply!

For manual calculation, I got: 0.99385864. For SimSIMD calculation, I got: 0.009096146.

I run it on Apple M2 chip if that info is relevant.

ashvardanian commented 2 months ago

It returns the cosine similarity, not the cosine distance, so you have to subtract the value from 1 to get the wanted result 🤗

edwinkys commented 2 months ago

Hmm I'm not quite sure about that. The manual calculation that I provided is the result of cosine similarity not cosine distance.

Sc = x • y / ||x|| ||y||

edwinkys commented 2 months ago

So, I went through some digging and I found the cosine similarity implementation in the code base: https://github.com/ashvardanian/SimSIMD/blob/18d17686124ddebd9fe55eee56b2e0273a613d4b/include/simsimd/spatial.h#L388-L414

The line 413 seems to imply that SimSIMD cosine implementation returns cosine distance instead of cosine similarity.

But even then, after I substract the SimSIMD result from 1 to obtain the cosine similarity, the result is 0.99090385 which is still a mismatch from the calculation that I got 0.99385864.

I also double checked my calculation result against this online calculator: https://www.omnicalculator.com/math/cosine-similarity?c=USD&v=trig:0,a0:1,a1:3,a2:5,b0:2,b1:4,b2:6

I'd love to learn more about why this happen and I'm willing to help if needed 😁

ashvardanian commented 2 months ago

@edwinkys the difference between 0.99090385 and 0.99385864 is due to the numerics error, likely coming from the vrsqrte_f32 operation, that approximates the reciprocal square root. You may get noticeably different results depending on the 1 / sqrt(x) implementation.

edwinkys commented 2 months ago

Oh I see. If it's an approximate square root calculation, it makes sense. Thank you for clarifying it!

ashvardanian commented 2 months ago

All floating point operations are approximate, but different libraries will have different accuracy/speed/complexity tradeoffs 🤗