Closed Adda0 closed 1 day ago
What does increment_current_state
do? It just ignores the flag?
Can you clarify on what do you want to better explain? RE2 can increment its internal state multiple times while still operating on a single Mata NFA state. Therefore, we need to omit incrementing the current Mata NFA state when, for example, RE2 increments its state for ^
symbol (the beginning of the line). This bool flag just allows us to specify that when we get such a state, we should not increment the current Mata NFA state.
If I understand correctly, it just ignores the flags ^
, $
, \b
, etc., right?
As of right now, yes. But it is meant to be generally usable if something like this appears again. It is a mechanism to make the NFA state independent of RE2 state.
I would have some discussion whether it is not better to throw an error for some of the flags, that we cannot handle them or something, but I am still ok with this.
Definitely. We have not tested \b
etc. And even more, RE2 seems to throw states with state kind EndOfText
or BeginOfText
for $
and ^
even though the kind should be EndOfLine
and BeginOfLine
. I do not understand it, but as of now, these non-printable characters are skipped.
This will play a role when we open a discussion about regex interpretation, that is, a{2}b
should match aab
, but also aab
inside fffaabfff
. We have two matching approaches: the first is just an automaton matching a{2}b
precisely, the other is .*a{2}b.*
, which is what normal regex matchers do. We should have a flag (by default, set to the first approach), where the user can define which matching approach they want (what kind of NFA they get from the regex). Then, the ^
and $
will play a role. In the first approach, they are irrelevant.
The EndOfLine
vs EndOfText
could be related with whether multi-line mode is enabled or not, by default I think it is disabled.
Good point. I believe this is correct. I will add the comment to the linked issue.
This PR attempts to fix parsing
^
and$
in regexes parsed by RE2. There are a lot of missing features that remain unimplemented and will have to be resolved in the future.The PR is supposed to fix issues from #457, #437, and #450. Whether it actually fixes the issues, that remains to be seen.