VeriFIT / mata

A fast and simple automata library
MIT License
23 stars 13 forks source link

RE2Parser bug with regex begin line and end line markers #459

Closed Adda0 closed 1 day ago

Adda0 commented 4 days ago

This PR attempts to fix parsing ^ and $ in regexes parsed by RE2. There are a lot of missing features that remain unimplemented and will have to be resolved in the future.

The PR is supposed to fix issues from #457, #437, and #450. Whether it actually fixes the issues, that remains to be seen.

jurajsic commented 1 day ago

What does increment_current_state do? It just ignores the flag?

Adda0 commented 1 day ago

Can you clarify on what do you want to better explain? RE2 can increment its internal state multiple times while still operating on a single Mata NFA state. Therefore, we need to omit incrementing the current Mata NFA state when, for example, RE2 increments its state for ^ symbol (the beginning of the line). This bool flag just allows us to specify that when we get such a state, we should not increment the current Mata NFA state.

jurajsic commented 1 day ago

If I understand correctly, it just ignores the flags ^, $, \b, etc., right?

Adda0 commented 1 day ago

As of right now, yes. But it is meant to be generally usable if something like this appears again. It is a mechanism to make the NFA state independent of RE2 state.

jurajsic commented 1 day ago

I would have some discussion whether it is not better to throw an error for some of the flags, that we cannot handle them or something, but I am still ok with this.

Adda0 commented 1 day ago

Definitely. We have not tested \b etc. And even more, RE2 seems to throw states with state kind EndOfText or BeginOfText for $ and ^ even though the kind should be EndOfLine and BeginOfLine. I do not understand it, but as of now, these non-printable characters are skipped.

This will play a role when we open a discussion about regex interpretation, that is, a{2}b should match aab, but also aab inside fffaabfff. We have two matching approaches: the first is just an automaton matching a{2}b precisely, the other is .*a{2}b.*, which is what normal regex matchers do. We should have a flag (by default, set to the first approach), where the user can define which matching approach they want (what kind of NFA they get from the regex). Then, the ^ and $ will play a role. In the first approach, they are irrelevant.

jurajsic commented 1 day ago

The EndOfLine vs EndOfText could be related with whether multi-line mode is enabled or not, by default I think it is disabled.

Adda0 commented 1 day ago

Good point. I believe this is correct. I will add the comment to the linked issue.