ashvardanian / SimSIMD

Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
998 stars 59 forks source link

Unsupported metric 'h' for two numpy matrices of type np.uint8? #166

Closed stuartatnosible closed 2 months ago

stuartatnosible commented 2 months ago

Hi There,

I was just checking out version 5.1.0 and I noticed that this code no longer works:

np.random.seed(42)

a_mat = np.ascontiguousarray(np.random.randint(0, 255, (100, 192), dtype=np.uint8))
b_mat = np.ascontiguousarray(np.random.randint(0, 255, (100, 192), dtype=np.uint8))

dist = np.array(simsimd.cdist(a_mat, b_mat, 'hamming'), dtype=np.uint8)

The error message that gets returned when I run the code is the following:

ValueError: Unsupported metric 'h' and datatype combination ('B'/'B' and 'B'/'B')

When using versions 5.0.1, 5.0.0, and 4.4.0 the code works exactly as I would expect.

Ideally given the changes in 5.1.0 I was hoping to be able to rewrite the code like so:

np.random.seed(42)

a_mat = np.ascontiguousarray(np.random.randint(0, 255, (100, 192), dtype=np.uint8))
b_mat = np.ascontiguousarray(np.random.randint(0, 255, (100, 192), dtype=np.uint8))

dist = np.array(simsimd.cdist(a_mat, b_mat, 'hamming', dtype="u8"))

To reduce the memory footprint when computing distances on very large matrices.

Am I missing something? My setup is Windows 11 + 13th Gen Intel(R) Core(TM) i7-13700HX.

Many Thanks, Stuart Reid

ashvardanian commented 2 months ago

Nice catch! There is a problem with how NumPy propagates binary data to CPython. Does it work if you set dtype=“b8”?

PS: Sorry for inconvenience 🤗

stuartatnosible commented 2 months ago

No worries, SimSIMD is epic. The following code:

np.random.seed(42)

a_mat = np.ascontiguousarray(np.random.randint(0, 255, (100, 192), dtype=np.uint8))
b_mat = np.ascontiguousarray(np.random.randint(0, 255, (100, 192), dtype=np.uint8))
dist = np.array(simsimd.cdist(a_mat, b_mat, metric="hamming", dtype="b8"))

Produces the same error using SimSIMD==5.1.0:

ValueError: Unsupported metric 'h' and datatype combination ('B'/'B' and 'B'/'B')

I've tried a bunch of variations but I'm not winning.

ashvardanian commented 2 months ago

Thanks @stuartatnosible! I've added tests covering the described use-case and added documentation to README.md. Please let me know if you notice any other issues 🤗