Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
520 stars 85 forks source link

iterator support (begin(), end()) #187

Closed mgood7123 closed 1 year ago

mgood7123 commented 1 year ago

is there support for iterators?

such as supported by std regex?

std::regex_search(data.begin(), data.end(), ... );

genivia-inc commented 1 year ago

The following type of iterators are provided with the reflex library to search:

for (auto& iter : some_matcher.find)
  ...

For example, with PCRE2:

#include <reflex/pcre2matcher.h>
for (auto& match : reflex::PCRE2Matcher("\\w+", "How now brown cow.").find)
  std::cout << match.text() << std::endl;

displays

How
now
brown
cow

You can bind the input "data" to the matcher with the input() method or pass it as an argument to the constructor as shown above with the literal string. Input can be string, wide string or an open file/stream handle such as string stream.

See https://www.genivia.com/doc/reflex/html/index.html#regex-methods-find

mgood7123 commented 1 year ago

what about for the input itself? eg

#include <reflex/pcre2matcher.h>

// a.txt : How now brown cow.

// obtain a bidirectional iterator to a.txt, from position 0 ('H') to EOF
// underlying interface could be mmap/MappedFile based, or std::ifstream based
auto it = BidirIt("a.txt");

// invoke the matcher with the bidirectional iterator, this avoids needing to read back the file if it is a large vile (size > 2GB)
for (auto& match : reflex::PCRE2Matcher("\\w+", it.begin(), it.end()).find)
  std::cout << match.text() << std::endl;

additionally how would we obtain non-matching input?

for example

// "abc14FOOre\n"
[NON_MATCH: abc] [MATCH: 14FOO] [NON_MATCH: re
]
genivia-inc commented 1 year ago

Non-matching input can be matched with a scanner scan instead of searching with find.

  for (auto& match : reflex::PCRE2Matcher("(abc)|(.|\\n)", <INPUT>).scan)
    if (match.accept() == 1)
      std::cout << "match: " << match.text() << std::endl;
    else
      std::cout << "not a match: " << match.text() << std::endl;

With this approach a non-match is a single character, not a string though.

To get all non-matches without matches, use split to split up the input between the patterns specified.

There is another more efficient way using a callback function registered as a handle. This handle is passed the buffer contents that are "shifted" out when the matcher advances. This way you get the input before a match even when that input part is very large and won't fit in a buffer and is shifted out. It's the most efficient way because we do not need a large buffer (as large as a whole file), which is used by ugrep. The part that fits in a buffer is always after the last newline (unless a line is several KB long). If you want to buffer the entire input, then use the buffer() method of the matcher. This means that all text (matches and non-matches) are in a single buffer space. You can also use first() - border() to point to border() bytes (non-0-terminated) before the text match from the start of a line.

mgood7123 commented 1 year ago

alright, but this still implies we can store the entire input in RAM for the scanner to scan, which might not be possible

for example, we may need to scan a file who's size is larger than our available free RAM

genivia-inc commented 1 year ago

Storing the entire input in RAM is not the approach I'd advocate. But you can define a regex that consumes the input between the matches you want and distinguish them by group number. If a match is very large, this has to be buffered though. So buffer size depends on the pattern matched. So spitting out a single mismatching character at a time as I've suggested is not a bad idea for performance.

Otherwise, you may want to think about defining a buffer shift handler as I've hinted at. The documentation doesn't mention this handler. It mentions other type of handlers. It's the reflex::AbstractMatcher::Handler abstract class that is a functor you can declare with behavior you want. Then set if with reflex::AbstractMatcher::set_handler(handler). The functor is passed the buffer buf, the amount shifted out of the buffer gap, and the position in the input num of the begin of line ("character count of the input till bol_").

mgood7123 commented 1 year ago

hmm alright

genivia-inc commented 1 year ago

It's often a battle of tradeoffs in technology and algorithms. Either it's a simple way, which is less efficient and eats memory, or the hard way, which is fast and uses memory smartly. Don't ask for miracles :)