ashvardanian / StringZilla

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖
https://ashvardanian.com/posts/stringzilla/
Apache License 2.0
1.92k stars 64 forks source link

Bug: sz_find incorrectly finds the substring with length=5 #154

Open vasily-v-ryabov opened 2 months ago

vasily-v-ryabov commented 2 months ago

Describe the bug

StringZilla incorrectly finds this case "Hello, world!".find("world", 0, 11) which must return -1.

Function sz_equal_serial doesn't handle correctly the suffix of substring "world" (suffix == "d"). It goes exactly to last statement return (sz_bool)(a_end == a); where a_end == "!" and a == "!" which is very strange because at the first line of the function there is an assignment sz_ptr_t const a_end = a + length; where length == 1 in debugger.

It looks like a while loop before return statement is written incorrectly for this case. b++ (b was empty "\0" string before entering the loop) remains empty randomly because subsequent memory is also clean (zeroes).

I'm aware about #72 though it is just a wish for optimization.

Steps to reproduce

Simple to test it in Python:

pip install stringzilla

and

>>> print("Hello, world!".find("world", 0, 11)) # original CPython algorithm
-1
>>> from stringzilla import Str
>>> print(Str("Hello, world!").find("world", 0, 11)) # StringZilla algorithm
7

Expected behavior

>>> print("Hello, world!".find("world", 0, 11)) # original CPython algorithm
-1

StringZilla version

3.8.4

Operating System

Ubuntu 22.04 and Windows 10/11 64-bit

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Are you open to being tagged as a contributor?

Is there an existing issue for this?

Code of Conduct

vasily-v-ryabov commented 2 months ago

OK, maybe the root cause is at higher level, not exactly in this function. I couldn't figure out the root cause yet.

ashvardanian commented 2 months ago

Interesting, which CPU model are you running on?

ashvardanian commented 2 months ago

Oh, it must be a binding issue. When I avoid passing the extra arguments I get:

>>> print("Hello, world!".find("world")) # original CPython algorithm
7
>>> print(Str("Hello, world!").find("world")) # StringZilla algorithm
7

So should be coming from python/lib.c, @vasily-v-ryabov.

vasily-v-ryabov commented 1 month ago

I run on different Intel CPUs: Core i7-8700 or Core i7-12700H.

vasily-v-ryabov commented 1 month ago

Also I see similar issue in C code.

ashvardanian commented 1 month ago

Wow, that's dangerous! Any chance there is an existing test you can extend in scripts/test.cpp to highlight that? PRs always welcomed!