Optimization when trying to find the longest substring match?

JobLeonard commented 3 years ago

Context: for the Lempel-Ziv 77 family of compression algorithms, one has to search through a sliding window for the longest possible substring match. I figured that there were SIMD algorithms for that these days. Your page on SIMD-friendly algorithms for substring searching was one of the first things to show up. However, while a beautiful write-up of wonderful research and code, the algorithm described there appears to serve a subtly different purpose: finding a substring of known length. In the LZ77 scenario we don't know ahead of time what the longest substring we are looking for is. Instead we have a very long string as our input (the sliding window of LZ77), and ask the question "taking a substring at starting index i, what is the longest substring match inside the sliding window behind it?"

However, conceptually it seems that there is an obvious way to optimize the algorithm you described on your page.

Taking your algorithm as a starting point, I just came up with the following: first we start with a 2-length string. We can skip the memcmp part (since any 2-length first+last matches are guaranteed to be complete matches). Then in a loop:

increase the string length,
test the last char of the new length
instead of a memcmp, we AND the result with the mask previous result
- if the new mask is non-zero, then basic induction tells use still have a substring match at the new length - we checked all characters so far after all.
- if we have no values left, then we know that the previous mask contained however many substring matches occurred within the text

So we basically loop until the "new mask" is empty, then use the second-to-last mask to determine where all the substring matches.

This is just something I came up with on the spot while reading your article, but it feels like the most straightforward, naive way to do this efficiently. Are you aware of whether or not this is a sensible approach when trying to find the longest substring match?

WojciechMula commented 3 years ago

Hi, thank you. To be honest, I was never thinking about the longest matches. Your approach seems interesting, I'd love to see how it performs in the real life. If I remember correctly, gzip uses hashes for short prefixes (3 letters?). It would be great to compare these.

rendello commented 2 years ago

Thanks for the article, it was mind-expanding and very useful.

To @JobLeonard, I was also using the article as a basis for LZ77 substring matching. While I looked at your method, I'm not sure how it's too different from just comparing individual bytes, as it seems to be expanding on the first substring match but isn't handling any latter matches. In any case, we could continue the conversation over email if you would like, I don't mean to spam these Github issues.

I just figured you might be interested in my alternate approach, unfortunately it's encapsulated in a giant Zig hell-function. It's not very battle-tested yet, but it seems to work from what I see. I want to diagram the function but I haven't been able to get a grip on diagramming tools so far! If you have any questions, feel free to email me, gaven@rendello.ca.

https://github.com/rendello/compressor/blob/431a4b14fcb097aefe785441b7318b8d5f328ec3/src/str_match.zig#L63

WojciechMula / sse4-strstr

Optimization when trying to find the longest substring match? #14