Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
529 stars 86 forks source link

Question about Re/flex usecase #211

Closed mkoushan closed 2 months ago

mkoushan commented 3 months ago

Hi. Does RE/flex regex beat hyperscan in multi-pattern matching for processing network traffics?

genivia-inc commented 3 months ago

The performance comparison table shown in the README is for scanning the input to tokenize. This requires an efficient regex matcher. The table does not show how efficient searching is. Hyperscan is fast for searching, but not for matching. RE/flex is fast for searching and matching. See the ugrep benchmarks that use RE/flex. I've not included search benchmarks in the RE/flex repo, because that's already done in the ugrep project. Ugrep will continue to get faster which is also benefitting RE/flex. The latest speed up with SIMD (SSE2/AVX2/NEON/AArch64) is for UTF-8 checking and validation in ugrep.

Both matching and searching are fast with RE/flex. Hope this helps.

genivia-inc commented 3 months ago

I can't comment on the speed of pattern matching for processing network traffic. But why wouldn't it be faster if it is faster than other regex matchers?

Note that the RE/flex regex matcher only supports group captures of the whole pattern, not sub-patterns. For example, (foo)|(bar) captures two patterns, so these are distinguishable. But ba((r)|(z)) does not capture anything. I plan to add capturing groups later using work I did in the past combined with work on TRFD.

mkoushan commented 2 months ago

I wrote a benchmark for both libraries. At searching in huge data RE/flex is almost twice the efficient as hyperscan, but in small data repeatedly coming through hyperscan is a little faster than RE/flex. How can I enhance my RE/flex to achieve best possible performance? I already tried making the pattern global.

genivia-inc commented 2 months ago

What do you mean by "small data", is this the input size searched?

The RE/flex matcher uses an internal buffer of 256K which is probably why it is a bit slower for short input. The 256K buffer can be set to a smaller size by setting the global macro REFLEX_BUFSZ to compile the RE/flex source code and your source code (must be at least 4K and defined globally for all code, it's declared and used in include/reflex/absmatcher.h).

Even better, you can specify your own memory region to search, thus avoiding the use of a buffer, which will speed this up (this approach is called zero copy overhead).

See: input methods

Method Result
buffer(b, n) use buffer of n bytes at address b with to a string of n-1 bytes (zero copy)

Note that the buffer must have an extra byte after the string of bytes of data to search/match that may be read by the matcher.

genivia-inc commented 2 months ago

I have updated the online RE/flex manual to explain "zero copy" more clearly. It was mentioned in the introduction towards the end, but I've added a section to the "Tips, Tricks and Gotchas": https://www.genivia.com/doc/reflex/html/index.html#zerocopy

mkoushan commented 2 months ago

I found that the match() method is much faster in this case because we only want to label the packets if the pattern matches. But something has concerned which the buffer() method only supports char* type and not const char* as the input. It would be great if you implement this. But after the suggestions you made and const casting the input data with match() method, the speed got x3 - x7 faster than hyperscan, even with small data.

Also about patterns, does RE/flex optimize the regex pattern it self at compile time? Also is there a way to cache the compiled pattern data so that we can load and unload it at runtime (because of the memory overhead.)

EDIT: How can I stop find() method from changing the input character? I don't need the text which matches I only want to know if pattern has at least 1 match in the input or not and the data must not change

genivia-inc commented 2 months ago

Please note that the buffer(b,n) method supports const char* data/text as input, but you have to cast it to const char* in that case, because normally it expects a writable buffer (just to be safe so to speak, for those who didn't read the details about it when it is safe to do so). You can pass a constant non-mutable region of data as long as you don't use text() (and unput(c), wunput(), rest() or span()) to fetch the match (because it needs to return a 0-terminated char* string). Use something else, like str() to fetch the match, which copies the match to a std::string. This way the specified input buffer and data is never modified. See the documentation notes about this technique.

Perhaps I should take a look at C++ std::string_view which is even more efficient to use as a possible future update to RE/flex to fetch matches that way too.

About patterns: the fastest method is to generate C++ code, see the documentation and the examples/fastsearch.cpp example that explains this. It involves two compilation steps, the first to produce the code and the second to compile that code into the fast searcher. Otherwise, a FSM table of opcode words is generated at runtime for the reflex::Pattern object. This does not take a lot of time, but it can be avoided by making it global/static so it is constructed just once. The thing to avoid at all cost is constructing reflex::Pattern objects in a loop. Always construct them just once, if possible.

Alternatively, you could also use the scanner generator to generate a "search engine" using %option find and write code associated with actions for each pattern to fire it, like a lexer/scanner, but it searches instead. See the examples/fastfind.l example.

mkoushan commented 2 months ago

Thanks for the information. I checked the source code and there's no buffer method which accepts const char* , also I made a mistake that the match() method is the thing I need, I actually need find() method. As far as I know from testing and source code, find() method changes the buffer data and I don't want it to do that. Also it returns all the matches but I just need it to check that if the pattern exist at least once in some sub-string of the data.

EDIT: Sorry I made a mistake again about find() method. Then I just need it to not change the buffer data and that's it.

mkoushan commented 2 months ago

well I tested with actual packets and patterns and RE/flex is extremely slow. It toke 43s to finish but with hyperscan 0.7s

mkoushan commented 2 months ago

thanks for your effort, closing issue.

genivia-inc commented 2 months ago

The find method uses SIMD SSE2/AVX2/NEON/AArch64 when enabled using the proper build steps. Otherwise it is much slower. We tested against rg and hyperscan and it showed excellent results.

Second, the buffer accepts a constant pointer to constant data, it is documented how and why in detail. Why don't you believe it doesn't? Just don't use text() or the other methods that change the buffer by writing a zero byte.

Strange way to close this question.

genivia-inc commented 2 months ago

I'm pretty sure the performance difference is due to searching a lot of shorter regex patterns or patterns that allow short matches, for which hyperscan is more optimized but RE/flex uses a standard non-SIMD optimized method. So ignore my last comment.

I don't know this use case. Where can I find this set of regex patterns?

This should not be difficult to optimize in RE/flex, because hyperscan uses SIMD bitap "buckets" to search. I read their paper a while ago. The method described in their paper is interesting, but isn't necessary faster for grepping text or with tokenization.

mkoushan commented 2 months ago

Second, the buffer accepts a constant pointer to constant data, it is documented how and why in detail. Why don't you believe it doesn't? Just don't use text() or the other methods that change the buffer by writing a zero byte.

I did not use any other methods except find(), and it does not compile if I pass const char* and when I pass non-const char* it crashes due to SEGFAULT.

I don't know this use case. Where can I find this set of regex patterns?

There are plenty of sources which provide free PCAP files. You can download them and open it in Wireshark, then copy it's data as C-String then test some patterns like DNS matching and DPI patterns.

If you're interested on optimizing it for this use-case, I'll open a new issue for it and I'll try to help in development too.

genivia-inc commented 2 months ago

OK, I understand your case. But I don't understand the segfault. I verified it works fine.

Here is a simple example using a Flex-compatible lexer (linked with code generated with reflex to tokenize input, e.g. see documentation on %option fast flex):

yyFlexLexer lexer;
std::ifstream file("<somfile>");
std::string str((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());
file.close();
lexer.buffer(const_cast<char*>(str.c_str()), str.size()+1);
while (lexer.yylex())
  std::cout << lexer.str() << std::endl;

This grabs a file in a string, then passes the string data with its terminating \0 to the buffer. The str() tokenization does not modify the buffer. In the upcoming update I will also introduce a lexer.strview() method that is faster (zero copy) than lexer.str().

For the performance bottleneck, please note that I am still evolving the ugrep tool to be faster. This effort affects the reflex project, because ugrep uses reflex. I am currently working on a SIMD version of the predict-match algorithm. Initial results that I have show that it speeds up searching, which is promising. Also, it could be that the find() method uses a suboptimal choice of method for the regex patterns you are using. The choice involves heuristics and methods to try to make the regex match quicker by predicting matches. This is relatively new in the way I wrote that logic, so it will likely evolve and improve over time

Once I have the main work done I am currently focussed on to write SIMD code, I will take a look at the DPI patterns.

genivia-inc commented 1 month ago

DPI is a broad area that includes specialized search methods that may or may not be based on regex/DFA and may require multiple matching (i.e. find matches that overlap) depending on the use case. RE/flex does not support multiple matching (i.e. no overlap) natively as it finds successive matches using SIMD SSE/AVX/Neon/AArch64 acceleration. Acceleration is used when possible. In some complex cases it may fail to pick a suitable acceleration method and use a slower default method instead. This can (or will) be improved in the future, but that depends on running RE/flex against a set of regex patterns to determine optimization opportunities that are missed in the current implementation. I am working on a ugrep and RE/flex update that is unrelated to this, but that will further optimize the search speed.

It is not clear to me what use cases and what regex patterns you have used and tested on your end.

Some sources: