BioJulia / Automa.jl

A julia code generator for regular expressions
Other
188 stars 15 forks source link

Improve debugging of Automa #64

Closed jakobnissen closed 2 years ago

jakobnissen commented 3 years ago

This PR adds some better debugging capacities to Automa.

I'll keep adding to this PR until we feel it's getting really useful.

Currently implemented

julia> eval(create_debug_function(machine; ascii=true)) execute_debug (generic function with 1 method)

julia> execute_debug("XY") (0, [('X', 2, Symbol[]), ('Y', 3, [:entering_A])])

### Debug_execute
Added function to test regexes easier:

julia> (cs, events) = debug_execute(my_regex, ">foo", ascii=true);

julia> events 5-element Vector{Tuple{Union{Nothing, Char}, Int64, Vector{Symbol}}}: ('>', 2, [:foo!]) ('f', 3, []) ('o', 4, []) ('o', 4, []) (nothing, 0, [])



 @kescobo Please come with suggestions of what other things to add which will make working with Automa more pleasant. :)
kescobo commented 3 years ago

Ooh, execute_debug looks very nice. I wonder if it would be worth it to have some kind of companion AutomaUtils.jl package that would be useful for development, with some macros for doing stuff like this a bit more easily. Then you could not worry about load times.

My ideal workflow would be, say I have, Oh, I don't know, this in a string:

LOCUS       SCU49845     5028 bp    DNA             PLN       21-JUN-1999
DEFINITION  Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
            (AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION   U49845
VERSION     U49845.1  GI:1293613
KEYWORDS    .
SOURCE      Saccharomyces cerevisiae (baker's yeast)
  ORGANISM  Saccharomyces cerevisiae
            Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
            Saccharomycetales; Saccharomycetaceae; Saccharomyces.
REFERENCE   1  (bases 1 to 5028)
...

I'd like to be able to work out section by section. So I'd start with

locus_line = "LOCUS       SCU49845     5028 bp    DNA             PLN       21-JUN-1999"
def_line = """
DEFINITION  Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
            (AXL2) and Rev7p (REV7) genes, complete cds."""
# ... etc

Then write some regexes (I don't remember re syntax off the top of my head, pretend those could be regular regex strings

sp = r"\s+"
loc = r"LOCUS"
acc = r"[A-Z]{3}\d{5}"
base_count = r"\d+"
bp = r"bp"
seq_type = r"(DNA)|(RNA)" # there are more of these
division = r"(PLN)|(ROD)|(MAM)" #36 
date = r"\d{2}-[A-Z]{3}-\d{4}"

Then start testing.

@test_actions begin
    machine = loc * sp * acc
    test_string = ""LOCUS       SCU49845"
end
action: loc[:enter]
action: loc[:exit]
action: sp[:enter]
action: sp[:exit]
action: loc[:enter]
action: loc[:exit]

Then be able to build that up. with combos. The most amazing would be if I have regexes A, B, and C = A+B+, some output like

action: C[:enter]
    action: A[:enter]
    action: A[:exit]
    action: A[:enter]
    action: A[:exit]
    action: B[:enter]
    action: B[:exit]
action: C[:exit]

But that's all for if things go right.Then if I've screwed up and nothing matches, I can go in and check the sub components or something.

kescobo commented 3 years ago

I guess maybe what I should do is start to build that genbank parser and then ping you when I get stuck. Then, you're not allowed to help with the parser itself, you're only allowed to build the tools that will let me help myself :laughing:

jakobnissen commented 3 years ago

That's probably a good idea, actually. I can create the @test_action stuff - but I can't make it do the nested stuff (like the C = A * B example).

kescobo commented 3 years ago

Fair enough, that's was just wild dreaming, any output of what actions are hit would be helpful I think.

jakobnissen commented 2 years ago

I think it's time for me to revisit this soon. On Slack, there was some complaints about Readers generated with Automa creating errors without enough information to understand what's going on. For example:

julia> FASTA.Record(">A\nTAG_A")
ERROR: ArgumentError: malformed FASTA file at line 2

In fact, the only reason line 2 is even shown is because it was added explicitly to the FASTA reader's code to ease debugging. But when the machine fails in this particular example, we actually have access to much more information. We have the machine, the machine state, the input byte, and the current state of the input buffer. With these it should be possible to print this error message:

Malformed FASTA at line 2:

   TAG_A
      ^
   Observed byte: '_' at state 9. Outgoing edges:
     *  EOF (action: :letters, :record)
     *  [*\-A-Za-z]
     *  \r (action: :letters)
     *  \n (action: :countlines, :letters)
     *  [\t\v ] (action: letters)

    Input byte is not in any outgoing edge, and machine therefore stopped.

Which would be much easier to debug.

codecov[bot] commented 2 years ago

Codecov Report

Merging #64 (30b1d00) into master (39b4c3c) will not change coverage. The diff coverage is 100.00%.

:exclamation: Current head 30b1d00 differs from pull request most recent head f9de5c0. Consider uploading reports for the commit f9de5c0 to get more accurate results

@@           Coverage Diff           @@
##           master      #64   +/-   ##
=======================================
  Coverage   93.33%   93.33%           
=======================================
  Files          16       16           
  Lines        1756     1756           
=======================================
  Hits         1639     1639           
  Misses        117      117           
Flag Coverage Δ
unittests 93.33% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/dfa.jl 88.40% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1e645f4...f9de5c0. Read the comment docs.

jakobnissen commented 2 years ago

Downstream tests may fail here, because I rebased on master, in which the ambiguity check is enabled. I will move towards 1.0, and re-enable the ambiguity check. It's crucial to have the machines do as you expect. With the current changes, the error message is a little easier to understand. Before, the error message would look like:

ERROR: Ambiguous DFA: Input 0x58 from NFA node(s) 2 & 7 can lead to actions nothing or [:entering_A]

Now it looks like (for a different error, but you get the idea)

ERROR: LoadError: Ambiguous NFA. After inputs ">\n", observing '\t' lead to conflicting action sets [:record] and [:mark]

, where the reported inputs is the minimal string needed to trigger the ambiguity error. The string is typically shockingly short, less than a dozen characters.

I found that while the previous error technically gave more information, knowing that the ambiguity happened in NFA nodes X and Y didn't help much because I had to dump the NFA and convert to SVG anyway, then manually check the possible inputs that led to the states.

Here is a minimal example. If you have any comments about the readability, I'd love to hear.

julia> using Automa; const re = Automa.RegExp; import Automa.RegExp: @re_str

julia> machine = let
           a = re"x"
           a.actions[:exit] = [:a]
           b = re"xy?"
           Automa.compile(a | b)
       end
ERROR: Ambiguous NFA. After inputs "x", observing EOF lead to conflicting action sets nothing and [:a]
kescobo commented 2 years ago

This is awesome - huge QOL improvement! I haven't had the opportunity to poke at Automa machines in a while, sorry :-(

jakobnissen commented 2 years ago

Latest commit adds code generation of a reasonable error message when a bad input is observed. It requires opting in for now, but will be the default in version 1.

Here's how it looks like when hacking FASTX.jl to use this functionality:

julia> using FASTX

julia> data = "
       >good_record
       ATAGA
       >bad_record
       >next
       TAG"
"\n>good_record\nATAGA\n>bad_record\n>next\nTAG"

julia> collect(FASTA.Reader(IOBuffer(data)))
ERROR: Error during FSM execution at buffer position 33.
Last 33 bytes were:

"\n>good_record\nATAGA\n>bad_record\n>"

Observed input: '>' at state 4. Outgoing edges:
 * [*\-A-Za-z]/mark
 * [\t\v ]
 * '\n'/countline
 * '\r'
 * EOF/record

Input is not in any outgoing edge, and machine therefore errored.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
[...]
jakobnissen commented 2 years ago

Going to merge #98 , then rebase this on master, then merge, then work towards v1. @kescobo there might still be more non-breaking to do on the usability front - and if so, please say so - but I'd rather merge this now and then add more stuff in a separate PR.

kescobo commented 2 years ago

Personally, I'd prefer to get to 1.0 and then do non-breaking stuff.