Bug: ValueError: Input vectors must be contiguous

ogencoglu commented 1 month ago

Describe the bug

I have a script and replaced scipy cdist with simsimd cdist:

                distance_matrix2 = simsimd.cdist(
                    np_array1,
                    np_array2,
                    metric="sqeuclidean",
                )

and got the following error: Bug: ValueError: Input vectors must be contiguous

Any pointer to what this error means?

Steps to reproduce

Basic test such as

matrix1 = np.random.randn(1106, 2)
matrix2 = np.random.randn(422, 2)
np.array(simsimd.cdist(matrix1, matrix2, metric="sqeuclidean"))

seems to be working. Any pointer to what this error means?

If not, I will try to send the numpy .npy files somehow.

Expected behavior

To execute

SimSIMD version

5.5.0

Operating System

MacOS

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Are you open to being tagged as a contributor?

[X] I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

[X] I have searched the existing issues

Code of Conduct

[X] I agree to follow this project's Code of Conduct

ogencoglu commented 1 month ago

I also saved the numpy arrays as .npy, loaded in jupyter and tried to run. Still the same error.

Tried to cast to np.float32 before running, still the same error.

They are of shape: (1106, 2) (422, 2)

ashvardanian commented 1 month ago

Sure, @ogencoglu, this means your vectors don't occupy a continuous buffer in memory and are strided - meaning spacing between nearby rows or matrix cells. Can you please share the output of:

print(matrix1.__array_interface__)
print(matrix2.__array_interface__)

There is also a workaround:

matrix1 = np.ascontiguousarray(matrix1)
matrix2 = np.ascontiguousarray(matrix2)
distance_matrix2 = simsimd.cdist(matrix1, matrix2, metric="sqeuclidean")

Please let me know what the output is and if the workaround helps. I can probably extend the interface to support at least row strides 🤗

ogencoglu commented 1 month ago

Thanks for swift reply.

Here is the output:

{'data': (4866344960, False), 'strides': (4, 3904), 'descr': [('', '<f4')], 'typestr': '<f4', 'shape': (976, 2), 'version': 3}
{'data': (105553123344480, False), 'strides': (4, 24), 'descr': [('', '<f4')], 'typestr': '<f4', 'shape': (6, 2), 'version': 3}

and for some other case:

{'data': (6049844224, False), 'strides': (4, 4424), 'descr': [('', '<f4')], 'typestr': '<f4', 'shape': (1106, 2), 'version': 3}
{'data': (6049985024, False), 'strides': (4, 1688), 'descr': [('', '<f4')], 'typestr': '<f4', 'shape': (422, 2), 'version': 3}

When I tried the np.ascontiguousarray trick, I get:

TypeError: Input tensors must have matching datatypes, check with `X.__array_interface__`

then I explicitly cast with .astype(float)

Now it works but all this makes it slower than scipy cdist though. I don't have very large matrices but my code calls this distance calculation hundreds of times. I guess these ascontiguousarray and casting overheads are the bottleneck.

Anyway I ended up implementing in pure numba which gives several times faster results than scipy cdist. So I got my problem solved. But I will be following SimSIMD closely for sure.

ashvardanian commented 1 month ago

Hey @ogencoglu! I've found the issue!

In your logs:

{'data': (4866344960, False), 'strides': (4, 3904), 'descr': [('', '<f4')], 'typestr': '<f4', 'shape': (976, 2), 'version': 3}
{'data': (105553123344480, False), 'strides': (4, 24), 'descr': [('', '<f4')], 'typestr': '<f4', 'shape': (6, 2), 'version': 3}

Both matrices truly have non-continuous layout. The first stride being smaller that the second indicates that you may have transposed the matrix. Overall, I recommend using a row-major layout. It will give better results with practically every framework, including NumBa implementations. SimSIMD explicitly discourages such usage.

ashvardanian / SimSIMD