Closed jakobnissen closed 2 years ago
Ooh, execute_debug
looks very nice. I wonder if it would be worth it to have some kind of companion AutomaUtils.jl
package that would be useful for development, with some macros for doing stuff like this a bit more easily. Then you could not worry about load times.
My ideal workflow would be, say I have, Oh, I don't know, this in a string:
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999
DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
(AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION U49845
VERSION U49845.1 GI:1293613
KEYWORDS .
SOURCE Saccharomyces cerevisiae (baker's yeast)
ORGANISM Saccharomyces cerevisiae
Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
Saccharomycetales; Saccharomycetaceae; Saccharomyces.
REFERENCE 1 (bases 1 to 5028)
...
I'd like to be able to work out section by section. So I'd start with
locus_line = "LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999"
def_line = """
DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
(AXL2) and Rev7p (REV7) genes, complete cds."""
# ... etc
Then write some regexes (I don't remember re
syntax off the top of my head, pretend those could be regular regex strings
sp = r"\s+"
loc = r"LOCUS"
acc = r"[A-Z]{3}\d{5}"
base_count = r"\d+"
bp = r"bp"
seq_type = r"(DNA)|(RNA)" # there are more of these
division = r"(PLN)|(ROD)|(MAM)" #36
date = r"\d{2}-[A-Z]{3}-\d{4}"
Then start testing.
@test_actions begin
machine = loc * sp * acc
test_string = ""LOCUS SCU49845"
end
action: loc[:enter]
action: loc[:exit]
action: sp[:enter]
action: sp[:exit]
action: loc[:enter]
action: loc[:exit]
Then be able to build that up. with combos. The most amazing would be if I have regexes A
, B
, and C = A+B+
, some output like
action: C[:enter]
action: A[:enter]
action: A[:exit]
action: A[:enter]
action: A[:exit]
action: B[:enter]
action: B[:exit]
action: C[:exit]
But that's all for if things go right.Then if I've screwed up and nothing matches, I can go in and check the sub components or something.
I guess maybe what I should do is start to build that genbank parser and then ping you when I get stuck. Then, you're not allowed to help with the parser itself, you're only allowed to build the tools that will let me help myself :laughing:
That's probably a good idea, actually. I can create the @test_action stuff - but I can't make it do the nested stuff (like the C = A * B example).
Fair enough, that's was just wild dreaming, any output of what actions are hit would be helpful I think.
I think it's time for me to revisit this soon.
On Slack, there was some complaints about Reader
s generated with Automa creating errors without enough information to understand what's going on. For example:
julia> FASTA.Record(">A\nTAG_A")
ERROR: ArgumentError: malformed FASTA file at line 2
In fact, the only reason line 2
is even shown is because it was added explicitly to the FASTA reader's code to ease debugging.
But when the machine fails in this particular example, we actually have access to much more information. We have the machine, the machine state, the input byte, and the current state of the input buffer. With these it should be possible to print this error message:
Malformed FASTA at line 2:
TAG_A
^
Observed byte: '_' at state 9. Outgoing edges:
* EOF (action: :letters, :record)
* [*\-A-Za-z]
* \r (action: :letters)
* \n (action: :countlines, :letters)
* [\t\v ] (action: letters)
Input byte is not in any outgoing edge, and machine therefore stopped.
Which would be much easier to debug.
Merging #64 (30b1d00) into master (39b4c3c) will not change coverage. The diff coverage is
100.00%
.:exclamation: Current head 30b1d00 differs from pull request most recent head f9de5c0. Consider uploading reports for the commit f9de5c0 to get more accurate results
@@ Coverage Diff @@
## master #64 +/- ##
=======================================
Coverage 93.33% 93.33%
=======================================
Files 16 16
Lines 1756 1756
=======================================
Hits 1639 1639
Misses 117 117
Flag | Coverage Δ | |
---|---|---|
unittests | 93.33% <100.00%> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
src/dfa.jl | 88.40% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 1e645f4...f9de5c0. Read the comment docs.
Downstream tests may fail here, because I rebased on master, in which the ambiguity check is enabled. I will move towards 1.0, and re-enable the ambiguity check. It's crucial to have the machines do as you expect. With the current changes, the error message is a little easier to understand. Before, the error message would look like:
ERROR: Ambiguous DFA: Input 0x58 from NFA node(s) 2 & 7 can lead to actions nothing or [:entering_A]
Now it looks like (for a different error, but you get the idea)
ERROR: LoadError: Ambiguous NFA. After inputs ">\n", observing '\t' lead to conflicting action sets [:record] and [:mark]
, where the reported inputs is the minimal string needed to trigger the ambiguity error. The string is typically shockingly short, less than a dozen characters.
I found that while the previous error technically gave more information, knowing that the ambiguity happened in NFA nodes X and Y didn't help much because I had to dump the NFA and convert to SVG anyway, then manually check the possible inputs that led to the states.
Here is a minimal example. If you have any comments about the readability, I'd love to hear.
julia> using Automa; const re = Automa.RegExp; import Automa.RegExp: @re_str
julia> machine = let
a = re"x"
a.actions[:exit] = [:a]
b = re"xy?"
Automa.compile(a | b)
end
ERROR: Ambiguous NFA. After inputs "x", observing EOF lead to conflicting action sets nothing and [:a]
This is awesome - huge QOL improvement! I haven't had the opportunity to poke at Automa machines in a while, sorry :-(
Latest commit adds code generation of a reasonable error message when a bad input is observed. It requires opting in for now, but will be the default in version 1.
Here's how it looks like when hacking FASTX.jl to use this functionality:
julia> using FASTX
julia> data = "
>good_record
ATAGA
>bad_record
>next
TAG"
"\n>good_record\nATAGA\n>bad_record\n>next\nTAG"
julia> collect(FASTA.Reader(IOBuffer(data)))
ERROR: Error during FSM execution at buffer position 33.
Last 33 bytes were:
"\n>good_record\nATAGA\n>bad_record\n>"
Observed input: '>' at state 4. Outgoing edges:
* [*\-A-Za-z]/mark
* [\t\v ]
* '\n'/countline
* '\r'
* EOF/record
Input is not in any outgoing edge, and machine therefore errored.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[...]
Going to merge #98 , then rebase this on master, then merge, then work towards v1. @kescobo there might still be more non-breaking to do on the usability front - and if so, please say so - but I'd rather merge this now and then add more stuff in a separate PR.
Personally, I'd prefer to get to 1.0 and then do non-breaking stuff.
This PR adds some better debugging capacities to Automa.
I'll keep adding to this PR until we feel it's getting really useful.
Currently implemented
By using
Automa.nfa2dot
, one can visualize the NFA, locate the two nodes (node 2 and 7 in this case) and pin down exactly where the conflict happensexecute_debug
function. You use it like this:It returns a tuple, with the first element being the stopping state, and the second element being a vector of (input_byte, machine_state, actions) tuples. If
ascii
is true, the input bytes areChar
s instead ofUInt8
to make it easier to read. Demonstration:julia> eval(create_debug_function(machine; ascii=true)) execute_debug (generic function with 1 method)
julia> execute_debug("XY") (0, [('X', 2, Symbol[]), ('Y', 3, [:entering_A])])
julia> (cs, events) = debug_execute(my_regex, ">foo", ascii=true);
julia> events 5-element Vector{Tuple{Union{Nothing, Char}, Int64, Vector{Symbol}}}: ('>', 2, [:foo!]) ('f', 3, []) ('o', 4, []) ('o', 4, []) (nothing, 0, [])