Closed MichaelBrazier closed 3 years ago
Performance testing in the past with RE/flex clearly showed that a block of 64K is minimally needed to ensure good performance. However, on most modern systems with solid state disks, 512K is optimal as was tested on Linux and Windows machines. It has been a while that I worked on this, so I don't recall the online resources I've consulted, but my tests showed that these sources are correct.
This performance boost is relevant when using RE/flex as a search engine, but not necessarily to tokenize input such as source code that is usually smaller.
It's the performance for tokenizing input that I'm concerned with, yes. (Though the files I'm tokenizing can be larger than the buffer size.) My profiler is reporting a high level of virtual memory being paged to disk, and the input buffer is the largest block of allocated memory by far.
Can the block size be made configurable per scanner, so users can choose their own tradeoff point?
Hm, 512K is not that large and unlikely to cause swapping unless your RAM is a few MB instead of one or more GB. Also, the buffer size will not change during the tokenization process even if your input file is much larger. The buffer is a window on the input and does not buffer all input.
Can the block size be made configurable per scanner, so users can choose their own tradeoff point?
Sure, we could do this like so:
#ifndef REFLEX_BLOCK_SIZE
static const size_t BLOCK = (256*1024); ///< buffer size and growth, buffer is initially 2*BLOCK size, at least 4096 bytes
#else
static const size_t BLOCK = REFLEX_BLOCK_SIZE;
#endif
I wrote the following section for the manual to update with the next release, perhaps this helps:
Use reflex
option −−full
to create a table DFA for the scanner's
regular expression patterns or option −−fast
to generate a direct-coded
DFA. Without one of these options, by default a DFA is created at runtime
and stored in heap space.
Compile the generated source code with -DREFLEX_BLOCK_SIZE=4096
to override
the internal buffer reflex::AbstractMatcher::Const::BLOCK
size. By
default, the reflex::AbstractMatcher::Const::BLOCK
size is 256K for a large
512K buffer is optimized for high-performance file searching and
tokenization. The buffer is a sliding buffered window over the input, i.e.
input files may be much larger than the buffer size. Furthermore, a small
buffer expands to accommodate larger pattern matches. However, when using
the line()
and wline()
methods, lines longer than REFLEX_BLOCK_SIZE
may
not fit and the return string values of line()
and wline()
may be
truncated. A reasonably small REFLEX_BLOCK_SIZE
is 8192 for a 16K buffer.
Small buffer sizes increase processing time, i.e. to frequently move the
buffered window along a file and increases the cost to decode UTF-16/32
into UTF-8 multibyte sequences.
@warning The value of REFLEX_BLOCK_SIZE
should not be less than 4096.
The v3.0.10 update includes REFLEX_BLOCK_SIZE
and documentation update.
@MichaelBrazier Can this be closed?
At line 105 of absmatcher.h, the constant BLOCK is set to 256*1024 = 2^18. The input buffer allocates twice this number of bytes, or 2^19 bytes. Surely that's much more than anyone would need?
BLOCK is also the largest number of bytes to be read in one chunk, but again, a chunk of 2^18 bytes seems extreme. A size of 1K, 2K or 4K would be more reasonable.