BioJulia / Automa.jl

A julia code generator for regular expressions
Other
188 stars 15 forks source link

Pushdown automata #59

Closed jakobnissen closed 2 years ago

jakobnissen commented 3 years ago

So @BenJWard recently mentioned it might be a good idea to implement pushdown automata (PDA) in order to broaden in the types of formats that Automa can handle. At the moment, Automa's finite state automata (FSA) can't handle any type of recursively defined grammars. Unfortunately, these are very common.

I wholeheartedly agree with this vision. I also believe that Automa already have large parts of the code needed to implement PDAs, and that the code is probably not that big of a change (though definitely a larger project with hundreds of lines of code).

I'd like to keep this issue open for a while and discuss/brainstorm approaches to do this. I think it's crucial that Automa retains its high performance we expect from its FSAs. That implies that we need to have a mechanism for "compiling" the PDAs to have the smallest possible amount of states and symbols in its stack.

As an example, I propose we use "SimpleNewick", a simplified Newick format. This format is inherently recursive and cannot be recognized by any existing automaton of Automa. The grammar is defined as:

@pushdown begin
    tree = subtree * ";"
    subtee = name | internal
    name = "" | ("A-Z" * name)
    internal = "(" * branchset * ")" * name
    branchset = subtree | (subtree * "," * branchset)
end

(as an aside, I think having a @pushdown macro work like this would be pretty cool, and not too hard to implement).

This grammar can be represented with an Automa nondeterministic finite automaton (NFA), which can be represented like this .dot file using ordinary Automa notation (where arrow down mean push onto the stack and arrow up means pop):

``` digraph { graph [ rankdir = LR ]; start [ shape = point ]; tree_s [ shape = circle ]; tree_e [ shape = doublecircle ]; tree_1 [ shape = circle ]; subtree_s [ shape = circle ]; subtree_e [ shape = circle ]; name_s [ shape = circle ]; name_1 [ shape = circle ]; name_e [ shape = circle ]; internal_s [ shape = circle ]; internal_1 [ shape = circle ]; internal_2 [ shape = circle ]; internal_3 [ shape = circle ]; internal_e [ shape = circle ]; branchset_s [ shape = circle ]; branchset_e [ shape = circle ]; branchset_1 [ shape = circle ]; branchset_2 [ shape = circle ]; start -> tree_s [ label = "ϵ/⬇tree" ]; tree_s -> subtree_s [ label = "ϵ/⬇subtree" ]; tree_1 -> tree_e [ label = "\";\"/⬆tree" ]; subtree_s -> name_s [ label = "ϵ/⬇name" ]; subtree_s -> internal_s [ label = "ϵ/⬇internal" ]; subtree_e -> branchset_e [ label = "ϵ/⬆branchset" ]; subtree_e -> branchset_1 [ label = "ϵ"]; subtree_e -> tree_1 [ label = "ϵ"]; name_s -> name_e [ label = "ϵ/⬆name" ]; name_s -> name_1 [ label = "\"A-Z\"" ]; name_1 -> name_s [ label = "ϵ/⬇name" ]; name_e -> name_e [ label = "ϵ/⬆name" ]; name_e -> subtree_e [ label = "ϵ/⬆subtree" ]; name_e -> internal_e [ label = "ϵ/⬆internal" ]; internal_s -> internal_1 [ label = "\"(\"" ]; internal_1 -> branchset_s [ label = "ϵ/⬇branchset" ]; internal_2 -> internal_3 [ label = "\")\"" ]; internal_3 -> name_s [ label = "ϵ/⬇name" ]; internal_e -> subtree_e [ label = "ϵ/⬆subtree" ]; branchset_s -> subtree_s [ label = "ϵ/⬇subtree" ]; branchset_1 -> branchset_2 [ label = "\",\"" ]; branchset_2 -> branchset_s [ label = "ϵ/⬇branchset" ]; branchset_e -> internal_2 [ label = "ϵ"]; branchset_e -> branchset_e [ label = "ϵ/⬆branchset" ]; } ```

It's pretty easy to create this NFA programatically from the grammar above. In fact, that's what I did (by hand, though). We can use existing Automa code to compile it first to an optimized deterministic finite automaton (DFA), and then to a new PDA type. The symbols on the stack can be encoded as simply Int32 for efficiency, or even as a Julia enum if we really want.

Challenges

Right now, I see a few challenges:

jakobnissen commented 3 years ago

Here is the hand-optimized DFA version of the above NFA. The algorithm I used was quite simple: Begin at the start node. Move along all possible epsilon-paths until you find an edge with some bytes on it. Keep track of what you push and pop on the stack. You are not allowed to move through an edge that pops a symbol that is not at the top of the stack (can be resolved at compile time). If the stack is empty, you only know for sure the stack is truly empty if you began at the start node. If you didn't you can move through any edges that pop symbols off the stack when the stack is empty.

Then, for each node you land on, jump past the byte-edge and repeat the process until you have visited all edges.

Finally, there might be some nodes that turn out to be identical: In the NFA in the previous comment, compare name_1 and internal_3. They have the exact same edges - namely one eps edge to name_s. So they are identical. Same with branchset_2 and internal_1. Collapse these to a single node.

In the notation below, ⬆+tree means "pop off as many tree symbols as possible, at least one", and ⬆?tree means "pop off one tree symbol if possible".

If you move to an e-edge with a self-eps edge that pops e.g, that's gives a ⬆+-action.

The only "magic" (non-algorithmic thinking I did) was that, in the NFA diagram, one can move from name_e to subtree_e through two paths. This should yield ⬆?internal,⬆subtree. Not sure how to make an algorithm that realizes that.

``` digraph { graph [ rankdir = LR ]; start [ shape = point ]; STATE1 [ shape = circle ]; INT1 [ shape = circle ]; INT3 [ shape = doublecircle ]; EOF_START [ shape = point ]; EOF_INT3 [ shape = point ]; start -> STATE1; STATE1 -> INT1 [ label = "\"(\"/⬇tree,⬇subtree,⬇internal" ]; STATE1 -> INT3 [ label = "\"[A-Z]\"/⬇tree,⬇subtree,⬇name" ]; STATE1 -> EOF_START [ label = "\";\"/EOF", style = dashed ]; INT1 -> INT1 [ label = "\",\"/⬇branchset" ]; INT1 -> INT1 [ label = "\"(\"/⬇branchset,⬇subtree,⬇internal" ]; INT1 -> INT3 [ label = "\"[A-Z]\"/⬇branchset,⬇subtree,⬇name" ]; INT1 -> INT3 [ label = "\")\"/⬆+branchset" ]; INT3 -> INT1 [ label = "\",\"/⬆+name,⬆?internal,⬆subtree" ] INT3 -> EOF_INT3 [ label = "\";\"/⬆+name,⬆?internal,⬆subtree,⬆+tree,EOF", style = dashed ]; INT3 -> INT3 [ label = "\"[A-Z]\"/⬇name" ]; INT3 -> INT3 [ label = "\")\"/⬆+name,⬆?internal,⬆subtree,⬆+branchset" ]; } ```
richardreeve commented 3 years ago

I've just seen this, and it would be great if recursive grammars were possible in Automa. I've got a handwritten newick and nexus tree parser in Phylo.jl which is a complete mess to be honest, and I'd love to be able to use Automa instead!

jakobnissen commented 3 years ago

I'd love to add this functionality, but I don't have the skills to do it. I think we need someone who really understand automata. Unless someone new comes around and makes a PR, this probably is not going to happen, unfortunately.

richardreeve commented 3 years ago

That's a shame. I used to use yacc and bison back in the day and got very spoiled. Crafting all this boilerplate by hand is not a happy thing at all. I may still adopt your tokeniser, which I take it I can call without needing to use the finite state machinery?

jakobnissen commented 3 years ago

The tokenizer is a finite state machine in disguise. You can also build a FSM with custom code where you manage some extra state - but that's probably going to be a terrible experience to write (although extremely efficient) 😅.

jakobnissen commented 3 years ago

@ Azzaare Regarding your question about PDA in the other issue - on a user-interface level, PDA and FSMs should be separate. First because the user needs to be sure they actually get an FSM when they asked for one, and second because FSMs and PDAs need to be constructed from different things. FSMs can be constructed from regular expressions, whereas PDAs need to be constructed from a context-free grammar. See the top post.

kescobo commented 3 years ago

Not sure you actually tagged @Azzaare

Azzaare commented 3 years ago

Thanks!

I also got used to Yacc and Bison when I was a student ... I guess I was spoiled too. I just finished writing a bibtex parser manually ... it was a pain. And that was just because of one small part of the grammar with recursion...

The good point is that if there is no direct syntax changes in FSM when eventually PDA would be implemented in Automa.jl, v1.0 can be tagged, and PDA can be added in some 1.x (and not force any v2.0). That was my main concern.

Maybe we need Yacc.jl and Bison.jl :p

Kolaru commented 3 years ago

I implemented a LaTeX math mode parser using Automa to build a PDA. Managing the stack manually was not actually too bad:

https://github.com/Kolaru/MathTeXParser.jl/blob/master/src/parser.jl

If there is interest for an example of a PDA implemented with Automa I can slim it down and make a PR.

jakobnissen commented 2 years ago

I'm going to close this as out of scope. It would be fantastic if someone would implement a pushdown automata generator, but it is probably not going to happen any time soon, and it might not even fit in this package, if someone was to implement it.