Closed genivia-inc closed 1 month ago
One thing that I would like to add is that the hyperscan bitap-based SIMD method seems oversold in their article. I made a similar implementation in ugrep to test as a prototype, but it runs slightly slower than my non-SIMD serial version, not faster! I've used standard bitap (aka. shift-or) for years now for multi-string (multi-pattern) search. It is an approximate method that is not unique to hyperscan. Their article mistakingly states that bitap is only for single pattern search, which is not true.
My AVX2 hashed bitap implementation for testing and benchmarking performs a 4-way simultaneous bitap step in the fastest possible way with a minimal number of SIMD instructions I could think of (this code is not copied from hyperscan). There is a bit of magic going on here, which is actually just bit-bashing to align the bitap bit vectors and masks. The pat_->vtp_[]
contains 4 bitap tables that are shifted for the 4-way vectorized bitap step.
/// Minimal 4 long patterns (MIN>=4) using bitap hashed pairs with AVX2
template <uint8_t MIN>
bool Matcher::simd_advance_pattern_min4_avx2(size_t loc)
{
const uint32_t btap = Pattern::Const::BTAP;
const __m128i vmod = _mm_set1_epi32(btap - 1);
const __m128i vselect = _mm_set_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 13, 9, 5, 1);
const __m128i voffset = _mm_set_epi32(0, btap, 2 * btap, 3 * btap);
uint32_t state0 = ~0U << (8 - (MIN - 1));
uint32_t state1 = ~0U << (8 - (MIN - 2));
uint32_t state2 = ~0U << (8 - (MIN - 3));
uint32_t state3 = ~0U << (8 - (MIN - 4));
if (MIN <= 6)
state3 = state2;
if (MIN <= 5)
state2 = state1;
if (MIN <= 4)
state1 = state0;
__m128i vstate = _mm_set_epi32(state0, state1, state2, state3);
__m128i vc0 = _mm_set1_epi32(buf_[loc++]);
while (true)
{
const char *s = buf_ + loc;
const char *e = buf_ + end_ - 3;
while (s < e)
{
__m128i vc1 = _mm_cvtepu8_epi32(_mm_loadu_si32(reinterpret_cast<const uint32_t*>(s)));
vc0 = _mm_alignr_epi8(vc1, vc0, 12);
// hash
__m128i vh = _mm_and_si128(_mm_xor_si128(vc0, _mm_slli_epi32(vc1, 6)), vmod);
// gather bitap hashed bits
__m128i vb = _mm_i32gather_epi32(reinterpret_cast<const int32_t*>(pat_->vtp_), _mm_or_si128(vh, voffset), 2);
// shift-or
vstate = _mm_or_si128(_mm_slli_epi32(vstate, 4), vb);
// pass last char to the next iteration
vc0 = vc1;
// get most significant bit of each byte, check each 2nd byte of the 4x32 bit words
uint32_t mask = _mm_extract_epi32(_mm_shuffle_epi8(vstate, vselect), 0);
if ((mask & 0x00000008) == 0 && pat_->predict_match(s - MIN + 0, MIN))
{
loc = s - buf_ - MIN + 0;
set_current(loc);
return true;
}
if ((mask & 0x00000404) == 0 && pat_->predict_match(s - MIN + 1, MIN))
{
loc = s - buf_ - MIN + 1;
set_current(loc);
return true;
}
if ((mask & 0x00020202) == 0 && pat_->predict_match(s - MIN + 2, MIN))
{
loc = s - buf_ - MIN + 2;
set_current(loc);
return true;
}
if ((mask & 0x01010101) == 0 && pat_->predict_match(s - MIN + 3, MIN))
{
loc = s - buf_ - MIN + 3;
set_current(loc);
return true;
}
// butterfly-or:
// a b c d input vec
// c d a b swizzle
// a|c b|d c|a d|b or
// b|d a|c d|b c|a swizzle
// a|c|b|d b|d|a|c c|a|d|b d|b|c|a or
vstate = _mm_or_si128(vstate, _mm_shuffle_epi32(vstate, 0x4e)); // = 01 00 11 10 = 1 0 3 2
vstate = _mm_or_si128(vstate, _mm_shuffle_epi32(vstate, 0xb1)); // = 10 11 00 01 = 2 3 0 1
s += 4;
}
loc = s - buf_;
set_current_and_peek_more(loc - 1);
loc = cur_ + 1;
if (loc + 3 >= end_)
return advance_pattern_min4<MIN>(loc - MIN);
}
}
Bitap performance is memory-bound, not CPU bound, so we can't expect miracles with AVX2. Indeed, my optimized serial version runs slightly faster on average in ugrep:
/// Minimal 4 char pattern using bitap hashed pairs
template <uint8_t MIN>
bool Matcher::advance_pattern_min4(size_t loc)
{
const Pattern::Pred *tap = pat_->tap_;
uint32_t state1 = ~0;
uint32_t state2 = ~0;
uint32_t mask = 1 << (MIN - 1);
while (true)
{
const char *s = buf_ + loc;
const char *e = buf_ + end_ - 2;
uint8_t c0 = static_cast<uint8_t>(*s);
while (s < e)
{
uint8_t c1 = static_cast<uint8_t>(*++s);
state2 = (state1 << 1) | tap[Pattern::bihash(c0, c1)];
c0 = static_cast<uint8_t>(*++s);
state1 = (state2 << 1) | tap[Pattern::bihash(c1, c0)];
if ((state2 & mask) == 0 && pat_->predict_match(s - MIN - 1, MIN))
{
loc = s - buf_ - MIN - 1;
set_current(loc);
return true;
}
if ((state1 & mask) == 0 && pat_->predict_match(s - MIN, MIN))
{
loc = s - buf_ - MIN;
set_current(loc);
return true;
}
}
loc = s - buf_;
set_current_and_peek_more(loc - 1);
loc = cur_ + 1;
if (loc + 2 >= end_)
{
if (loc + 1 >= end_)
return false;
--loc;
state1 = state2;
}
}
}
I've also checked the false positive rate for this method, which impacts performance a lot, because false positives don't match, but frequently break this vectorized loop. However, with hashed bitap pairs I was able to get the false positive rate below 5% using the right bitap table sizes and hash functions, which is great and much better than I had expected. It only gets worse when searching a lot of patterns, like over 1000 words in a file for which PM4 and Bloom filter hashing runs faster.
The AVX2 bitap SIMD logic is enabled by definingWITH_BITAP_AVX2
to compile the source code on an AVX2-capable CPU.
Updated ugrep 7.0 benchmarks.
Update v7 overhauls the search engine internals to further increase ugrep's search speeds. This update includes new SIMD logic and improved pattern analysis to collect metrics for the heuristic selection of search algorithms. To evaluate the optimizations, I have been using a benchmark setup for arm64 and x64 for a while now that checks the search speed of multi-word searches from 1 to 1024 words to search of minimal lengths 1 to 8 in a 100MB text file. The performance results collected are also a good indication for the improved regex search speed, because the internal optimizations are identical.
The v7 update runs faster overall. I will show one example. There are many more, but those are similar. I won't bore you with the details. Searching 1 to 32 words (patterns like
word1|word2|...|word32
) with randomly picked words of length >=3 in a 100MB file (enwik8) is more consistently faster in v7 (new version) for arm64 CPUs versus v6 (old version):Each point in the graph is an average of 100 runs with the same number of words, but different words, of length >=3.
For x64 CPUs:
Furthermore, the DFA cut optimization algorithm is improved. This optimization speeds up regex search of patterns with leading repetitions and other forms that tend to be expensive to find, such as
[a-zA-Z]*z
that match words that end in az
. When the DFA is cut, the DFA-based regex match first looks for az
then looks back to include the leading part. It is not perfect (looking back using a reversed regex would be best), but works very well in general for a wide range of cases in the benchmarks I've created to specifically evaluate and optimize DFA cuts. Perhaps I will publish the details in an article at a later date.The v7 update also corrects minor regex anchor and word boundary matching issues I found in more specialized and extensive testing setups, which may occur when multiple anchors/bounaries are used that may require backtracking. Backtracking is very costly, so ugrep backtracks in a limited way to avoid slowdowns. Normally this does not matter when using one anchor/boundary in a regex, but when multiple are used in the same pattern then things get complicated. Update v7 strikes a better balance to support regex anchors and word boundary placements within a regex pattern, but does not backtrack unlimited. As a result of limited backtracking, there may be special cases of patterns with multiple anchors and word boundaries for which the pattern match results could differ from Perl (PCRE) matching, such as showing shorter pattern matches.
Verification with code inspections, correctness testing and benchmarking are critical, which takes time, often more time than coming up with new ideas and implementations in the first place.
There is always more that can be done. Making this project more perfect is an endless pursuit. I will update below if other things come to mind for update v7 that are insightful to share.