Default input buffer is 2^19 bytes? - Githubissues

Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.

https://www.genivia.com/doc/reflex/html

BSD 3-Clause "New" or "Revised" License

522 stars 85 forks source link

Default input buffer is 2^19 bytes? #111

Closed MichaelBrazier closed 3 years ago

MichaelBrazier commented 3 years ago

At line 105 of absmatcher.h, the constant BLOCK is set to 256*1024 = 2^18. The input buffer allocates twice this number of bytes, or 2^19 bytes. Surely that's much more than anyone would need?

BLOCK is also the largest number of bytes to be read in one chunk, but again, a chunk of 2^18 bytes seems extreme. A size of 1K, 2K or 4K would be more reasonable.

genivia-inc commented 3 years ago

Performance testing in the past with RE/flex clearly showed that a block of 64K is minimally needed to ensure good performance. However, on most modern systems with solid state disks, 512K is optimal as was tested on Linux and Windows machines. It has been a while that I worked on this, so I don't recall the online resources I've consulted, but my tests showed that these sources are correct.

genivia-inc commented 3 years ago

This performance boost is relevant when using RE/flex as a search engine, but not necessarily to tokenize input such as source code that is usually smaller.

MichaelBrazier commented 3 years ago

It's the performance for tokenizing input that I'm concerned with, yes. (Though the files I'm tokenizing can be larger than the buffer size.) My profiler is reporting a high level of virtual memory being paged to disk, and the input buffer is the largest block of allocated memory by far.

Can the block size be made configurable per scanner, so users can choose their own tradeoff point?

genivia-inc commented 3 years ago

Hm, 512K is not that large and unlikely to cause swapping unless your RAM is a few MB instead of one or more GB. Also, the buffer size will not change during the tokenization process even if your input file is much larger. The buffer is a window on the input and does not buffer all input.

genivia-inc commented 3 years ago

Can the block size be made configurable per scanner, so users can choose their own tradeoff point?

Sure, we could do this like so:

#ifndef REFLEX_BLOCK_SIZE
    static const size_t BLOCK = (256*1024); ///< buffer size and growth, buffer is initially 2*BLOCK size, at least 4096 bytes
#else
    static const size_t BLOCK = REFLEX_BLOCK_SIZE;
#endif

genivia-inc commented 3 years ago

I wrote the following section for the manual to update with the next release, perhaps this helps:

How to minimize runtime memory usage

Use reflex option −−full to create a table DFA for the scanner's regular expression patterns or option −−fast to generate a direct-coded DFA. Without one of these options, by default a DFA is created at runtime and stored in heap space.
Compile the generated source code with -DREFLEX_BLOCK_SIZE=4096 to override the internal buffer reflex::AbstractMatcher::Const::BLOCK size. By default, the reflex::AbstractMatcher::Const::BLOCK size is 256K for a large 512K buffer is optimized for high-performance file searching and tokenization. The buffer is a sliding buffered window over the input, i.e. input files may be much larger than the buffer size. Furthermore, a small buffer expands to accommodate larger pattern matches. However, when using the line() and wline() methods, lines longer than REFLEX_BLOCK_SIZE may not fit and the return string values of line() and wline() may be truncated. A reasonably small REFLEX_BLOCK_SIZE is 8192 for a 16K buffer. Small buffer sizes increase processing time, i.e. to frequently move the buffered window along a file and increases the cost to decode UTF-16/32 into UTF-8 multibyte sequences.

@warning The value of REFLEX_BLOCK_SIZE should not be less than 4096.

genivia-inc commented 3 years ago

The v3.0.10 update includes REFLEX_BLOCK_SIZE and documentation update.

genivia-inc commented 3 years ago

@MichaelBrazier Can this be closed?