ashvardanian / SimSIMD

Up to 200x Faster Inner Products and Vector Similarity — for Python, JavaScript, Rust, C, and Swift, supporting f64, f32, f16 real & complex, i8, and binary vectors using SIMD for both x86 AVX2 & AVX-512 and Arm NEON & SVE 📐
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
797 stars 42 forks source link

Jaccard in parallel? simsimd.cdist #65

Closed norsedrunkensailor closed 5 months ago

norsedrunkensailor commented 5 months ago

Will it be possible to extend simsimd.cdist to allow for Jaccard distances to be calculated in batch?

ashvardanian commented 5 months ago

It should already be supported. If not, can you provide a snippet or maybe extend the test?

norsedrunkensailor commented 5 months ago

Is this correct usage?

Screenshot 2024-01-22 at 09 55 31
ashvardanian commented 5 months ago

That shouldn't work @norsedrunkensailor, as Jaccard distance is a distance between sets, not continuous vectors. In our case, its implemented for bitsets. So you may want to compare values against ones and then call np.packbits before passing to SimSIMD. Let me know of that helps 🤗

norsedrunkensailor commented 5 months ago

A yes, of course -- sorry. I was trying to implement a method proposed in https://link.springer.com/article/10.1007/s41060-017-0064-z which reduces the number of operations using the related Tanimoto Coeff and some bounding conditions. Using np.packbits works. Is there a way of monitoring progress for large (10^5 by 10^5 batches of all pairs similarity search)? to get any estimate of how long it will take? Thank you again 😁🍀

ashvardanian commented 5 months ago

@norsedrunkensailor for progress tracking please check out the USearch library. It adds multithreading and custom logging functionality among other things 🤗