Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
522 stars 85 forks source link

Default input buffer is 2^19 bytes? #111

Closed MichaelBrazier closed 3 years ago

MichaelBrazier commented 3 years ago

At line 105 of absmatcher.h, the constant BLOCK is set to 256*1024 = 2^18. The input buffer allocates twice this number of bytes, or 2^19 bytes. Surely that's much more than anyone would need?

BLOCK is also the largest number of bytes to be read in one chunk, but again, a chunk of 2^18 bytes seems extreme. A size of 1K, 2K or 4K would be more reasonable.

genivia-inc commented 3 years ago

Performance testing in the past with RE/flex clearly showed that a block of 64K is minimally needed to ensure good performance. However, on most modern systems with solid state disks, 512K is optimal as was tested on Linux and Windows machines. It has been a while that I worked on this, so I don't recall the online resources I've consulted, but my tests showed that these sources are correct.

genivia-inc commented 3 years ago

This performance boost is relevant when using RE/flex as a search engine, but not necessarily to tokenize input such as source code that is usually smaller.

MichaelBrazier commented 3 years ago

It's the performance for tokenizing input that I'm concerned with, yes. (Though the files I'm tokenizing can be larger than the buffer size.) My profiler is reporting a high level of virtual memory being paged to disk, and the input buffer is the largest block of allocated memory by far.

Can the block size be made configurable per scanner, so users can choose their own tradeoff point?

genivia-inc commented 3 years ago

Hm, 512K is not that large and unlikely to cause swapping unless your RAM is a few MB instead of one or more GB. Also, the buffer size will not change during the tokenization process even if your input file is much larger. The buffer is a window on the input and does not buffer all input.

genivia-inc commented 3 years ago

Can the block size be made configurable per scanner, so users can choose their own tradeoff point?

Sure, we could do this like so:

#ifndef REFLEX_BLOCK_SIZE
    static const size_t BLOCK = (256*1024); ///< buffer size and growth, buffer is initially 2*BLOCK size, at least 4096 bytes
#else
    static const size_t BLOCK = REFLEX_BLOCK_SIZE;
#endif
genivia-inc commented 3 years ago

I wrote the following section for the manual to update with the next release, perhaps this helps:

How to minimize runtime memory usage

genivia-inc commented 3 years ago

The v3.0.10 update includes REFLEX_BLOCK_SIZE and documentation update.

genivia-inc commented 3 years ago

@MichaelBrazier Can this be closed?