Closed jakobnissen closed 3 years ago
Simplified to
machine = let
C = re.rep(re"A" * re"B")
D = C * re"A"
C.actions[:enter] = [:enter]
C.actions[:exit] = [:exit]
Automa.compile(D)
end
Note in particular that the C pattern is entered once, and then can be exited an arbitrary amount of times with an input like "ABABAB"
.
Right, so I think I've figured out the problem. Let's have a simple regex C
:
A = re"XZ"
B = re"XY"
C = A | B
A.actions[:enter] = [:entering_A]
B.actions[:enter] = [:entering_B]
A.actions[:exit] = [:exiting_A]
B.actions[:exit] = [:exiting_B]
Now, create an NFA using Automa.re2nfa(C)
and visualize it. This NFA represents those regexes and their actions perfectly. Note that this NFA is truly nondeterministic; when encountering an "X"
byte, it is impossible to know if we are entering the A
regex or the B
regex. In the NFA, this is visualized by two different epsilon edges, one that leads to A
and one that leads to B
. Only the one that leads to A
have the action entering_A
associated with it, and vice versa for B.
Now convert the NFA to a DFA. The resulting DFA is equivalent to the NFA only in the sense that it recognizes the same inputs. The edges are NOT the same, and consequently, the two epsilon edges, labeled with entering_A
and entering_B
are collapsed, and important information is lost. The result is that when we encounter an "X"
byte, the regex enters BOTH A and B. Notably, at end of file, only ONE of these regexes are exited (which is clearly absurd)
This is impossible to solve, I think, but it would really be nice if there was some sort of warning or even an error if attempting to convert an NFA that had actions associated to ambiguous transitions to a DFA.
Edit: This can be done by modifying the epsilon_closure
function. If you can go multiple paths from one node to another node only moving through epsilon edges, and not all these paths have the same labeled edges, then the machine is ambiguous and should not be compiled.
Solved by #49
Consider the following, highly simplified FASTA format:
By definition of
record
, arecord
MUST contain aheader
, which contains a>
. However see the flowchart produced for this machine by:The state 5 corresponds to
seqline
, and a following newline brings it to state 6. In this state transition, the actionrecord
is executed. But clearly, from the defition, we did not necessarily exitrecord
! This creates a loop between state 5 and 6, such thatrecord
is executed at every newline.This means that
record
can be repeatedly exited, without it being entered, and allows for e.g. arecord
only containingseqline
, against the definition above.