andrewcooke / ParserCombinator.jl

A parser combinator library for Julia
Other
107 stars 20 forks source link

Build status Codecov

ParserCombinator

A parser combinator library for Julia, similar to those in other languages, like Haskell's Parsec or Python's pyparsing. It can parse any iterable type (not just strings) (except for regexp matchers, of course).

ParserCombinator's main advantage is its flexible design, which separates the matchers from the evaluation strategy. This makes it easy to "plug in" memoization, or debug traces, or to restrict backtracking in a similar way to Parsec - all while using the same grammar.

It also contains pre-built parsers for Graph Modelling Language and DOT.

Example

using ParserCombinator

# the AST nodes we will construct, with evaluation via calc()

abstract type Node end
Base.:(==)(n1::Node, n2::Node) = n1.val == n2.val
calc(n::Float64) = n
struct Inv<:Node val end
calc(i::Inv) = 1.0/calc(i.val)
struct Prd<:Node val end
calc(p::Prd) = Base.prod(map(calc, p.val))
struct Neg<:Node val end
calc(n::Neg) = -calc(n.val)
struct Sum<:Node val end
calc(s::Sum) = Base.sum(map(calc, s.val))

# the grammar (the combinators!)

sum = Delayed()
val = E"(" + sum + E")" | PFloat64()

neg = Delayed()       # allow multiple (or no) negations (eg ---3)
neg.matcher = val | (E"-" + neg > Neg)

mul = E"*" + neg
div = E"/" + neg > Inv
prd = neg + (mul | div)[0:end] |> Prd

add = E"+" + prd
sub = E"-" + prd > Neg
sum.matcher = prd + (add | sub)[0:end] |> Sum

all = sum + Eos()

# and test

# this prints 2.5
calc(parse_one("1+2*3/4", all)[1])

# this prints [Sum([Prd([1.0]),Prd([2.0])])]
parse_one("1+2", all)

Some explanation of the above:

And it supports packrat parsing too (more exactly, it can memoize results to avoid repeating matches).

Still, for large parsing tasks (eg parsing source code for a compiler) it would probably be better to use a wrapper around an external parser generator, like Anltr.

Note: There's an issue with the Compat library which means the code above (the assignment to Delayed.matcher) doesn't work with 0.3. See calc.jl for the uglier, hopefully temporary, 0.3 version.

Install

julia> Pkg.add("ParserCombinator")

Manual

Evaluation

Once you have a grammar (see below) you can evaluate it against some input in various ways:

These are all implemented by providing different Config subtypes. For more information see Design, types.jl and parsers.jl.

Basic Matchers

In what follows, remember that the power of parser combinators comes from how you combine these. They can all be nested, refer to each other, etc etc.

Equality

julia> parse_one("abc", Equal("ab"))
1-element Array{Any,1}:
 "ab"

julia> parse_one("abc", Equal("abx"))
ERROR: ParserCombinator.ParserException("cannot parse")

This is so common that there's a corresponding string literal (it's "e" for `Equal(), the corresponding matcher).

julia> parse_one("abc", e"ab")
1-element Array{Any,1}:
 "ab"

Sequences

Matchers return lists of values. Multiple matchers can return lists of lists, or the results can be "flattened" a level (usually more useful):

julia> parse_one("abc", Series(Equal("a"), Equal("b")))
2-element Array{Any,1}:
 "a"
 "b"

julia> parse_one("abc", Series(Equal("a"), Equal("b"); flatten=false))
2-element Array{Any,1}:
 Any["a"]
 Any["b"]

julia> parse_one("abc", Seq(Equal("a"), Equal("b")))
2-element Array{Any,1}:
 "a"
 "b"

julia> parse_one("abc", And(Equal("a"), Equal("b")))
2-element Array{Any,1}:
 Any["a"]
 Any["b"]

julia> parse_one("abc", e"a" + e"b")
2-element Array{Any,1}:
 "a"
 "b"

julia> parse_one("abc", e"a" & e"b")
2-element Array{Any,1}:
 Any["a"]
 Any["b"]

Where Series() is implemented as Seq() or And(), depending on the value of flatten (default true).

Warning - The sugared syntax has to follow standard operator precedence, where | binds more tightly that +. This means that

   matcher1 + matcher2 | matcher3

is almost always an error because it means:

   matcher1 + (matcher2 | matcher3)

while what was intended was:

   (matcher1 + matcher2) | matcher3

Empty Values

Often, you want to match something but then discard it. An empty (or discarded) value is an empty list. This may help explain why I said flattening lists was useful above.

julia> parse_one("abc", And(Drop(Equal("a")), Equal("b")))
2-element Array{Any,1}:
 Any[]
 Any["b"]

julia> parse_one("abc", Seq(Drop(Equal("a")), Equal("b")))
1-element Array{Any,1}:
 "b"

julia> parse_one("abc", ~e"a" + e"b")
1-element Array{Any,1}:
 "b"

julia> parse_one("abc", E"a" + e"b")
1-element Array{Any,1}:
 "b"

Note the ~ (tilde / home directory) and capital E in the last two examples, respectively.

Alternates

julia> parse_one("abc", Alt(e"x", e"a"))
1-element Array{Any,1}:
 "a"

julia> parse_one("abc", e"x" | e"a")
1-element Array{Any,1}:
 "a"

Warning - The sugared syntax has to follow standard operator precedence, where | binds more tightly that +. This means that

   matcher1 + matcher2 | matcher3

is almost always an error because it means:

   matcher1 + (matcher2 | matcher3)

while what was intended was:

   (matcher1 + matcher2) | matcher3

Regular Expressions

julia> parse_one("abc", Pattern(r".b."))
1-element Array{Any,1}:
 "abc"

julia> parse_one("abc", p".b.")
1-element Array{Any,1}:
 "abc"

julia> parse_one("abc", P"." + p"b.")
1-element Array{Any,1}:
 "bc"

As with equality, a capital prefix to the string literal ("p" for "pattern" by the way) implies that the value is dropped.

Note that regular expresions do not backtrack. A typical, greedy, regular expression will match as much of the input as possible, every time that it is used. Backtracking only exists within the library matchers (which can duplicate regular expression functionality, when needed).

Repetition

julia> parse_one("abc", Repeat(p"."))
3-element Array{Any,1}:
 "a"
 "b"
 "c"

julia> parse_one("abc", Repeat(p".", 2))
2-element Array{Any,1}:
 "a"
 "b"

julia> collect(parse_all("abc", Repeat(p".", 2, 3)))
2-element Array{Any,1}:
 Any["a","b","c"]
 Any["a","b"]

julia> parse_one("abc", Repeat(p".", 2; flatten=false))
2-element Array{Any,1}:
 Any["a"]
 Any["b"]

julia> collect(parse_all("abc", Repeat(p".", 0, 3)))
4-element Array{Any,1}:
 Any["a","b","c"]
 Any["a","b"]
 Any["a"]
 Any[]

julia> collect(parse_all("abc", Repeat(p".", 0, 3; greedy=false)))
4-element Array{Any,1}:
 Any[]
 Any["a"]
 Any["a","b"]
 Any["a","b","c"]

You can also use Depth() and Breadth() for greedy and non-greedy repeats directly (but Repeat() is more readable, I think).

The sugared version looks like this:

julia> parse_one("abc", p"."[1:2])
2-element Array{Any,1}:
 "a"
 "b"

julia> parse_one("abc", p"."[1:2,:?])
1-element Array{Any,1}:
 "a"

julia> parse_one("abc", p"."[1:2,:&])
2-element Array{Any,1}:
 Any["a"]
 Any["b"]

julia> parse_one("abc", p"."[1:2,:&,:?])
1-element Array{Any,1}:
 Any["a"]

Where the :? symbol is equivalent to greedy=false and :& to flatten=false (compare with + and & above).

There are also some well-known special cases:

julia> collect(parse_all("abc", Plus(p".")))
3-element Array{Any,1}:
 Any["a","b","c"]
 Any["a","b"]
 Any["a"]

julia> collect(parse_all("abc", Star(p".")))
4-element Array{Any,1}:
 Any["a","b","c"]
 Any["a","b"]
 Any["a"]
 Any[]

Full Match

To ensure that all the input is matched, add Eos() to the end of the grammar.

julia> parse_one("abc", Equal("abc") + Eos())
1-element Array{Any,1}:
 "abc"

julia> parse_one("abc", Equal("ab") + Eos())
ERROR: ParserCombinator.ParserException("cannot parse")

Transforms

Use App() or > to pass the current results to a function (or datatype constructor) as individual values.

julia> parse_one("abc", App(Star(p"."), tuple))
1-element Array{Any,1}:
 ("a","b","c")

julia> parse_one("abc", Star(p".") > string)
1-element Array{Any,1}:
 "abc"

The action of Appl() and |> is similar, but everything is passed as a single argument (a list).

julia> type Node children end

julia> parse_one("abc", Appl(Star(p"."), Node))
1-element Array{Any,1}:
 Node(Any["a","b","c"])

julia> parse_one("abc", Star(p".") |> x -> map(uppercase, x))
3-element Array{Any,1}:
 "A"
 "B"
 "C"

Lookahead And Negation

Sometimes you can't write a clean grammar that just consumes data: you need to check ahead to avoid something. Or you need to check ahead to make sure something works a certain way.

julia> parse_one("12c", Lookahead(p"\d") + PInt() + Dot())
2-element Array{Any,1}:
 12
   'c'

julia> parse_one("12c", Not(Lookahead(p"[a-z]")) + PInt() + Dot())
2-element Array{Any,1}:
 12
   'c'

More generally, Not() replaces any match with failure, and failure with an empty match (ie the empty list).

Other

Backtracking

By default, matchers will backtrack as necessary.

In some (unusual) cases, it is useful to disable backtracking. For example, see PCRE's "possessive" matching. This can be done here on a case-by-case basis by adding backtrack=false to Repeat(), Alternatives() and Series(), or by appending ! to the matchers that those functions generate: Depth!, Breadth!, Alt!, Seq! and And!.

For example,

collect(parse_all("123abc", Seq!(p"\d"[0:end], p"[a-z]"[0:end])))

will give just a single match, because Seq! (with trailing !) does not backtrack the Repeat() child matchers.

However, since regular expressions do not backtrack, it would have been simpler, and faster, to write the above as

collect(parse_all("123abc", p"\d+[a-z]+"))

Using backtrack=false only disables backtracking of the direct children of those matchers. To disable all backtracking, then the change must be made to all matchers in the grammar. For example, in theory, the following two grammars have different backtracking behaviour:

Series(Repeat(e"a", 0, 3), e"b"; backtrack=false)
Series(Repeat(e"a", 0, 3; backtrack=false), e"b"; backtrack=false)

(although, in practice, they are identical, in this contrived example, because e"a" doesn't backtrack anyway).

This makes a grammar more efficient, but also more specific. It can reduce the memory consumed by the parser, but does not guarantee that resources will be released - see the next section for a better approach to reducing memory use.

Controlling Memory Use

Haskell's Parsec, if I understand correctly, does not backtrack by default. More exactly, it does not allow input that has been consumed (matched) to be read again. This reduces memory consumption (at least when parsing files, since read data can be discarded), but only accepts LL(1) grammars.

To allow parsing of a wider range of grammars, Parsec then introduces the Try combinator, which enables backtracking in some (generally small) portion of the grammar.

The same approach can be used with this library, using parse_try.

file1.txt:
abcdefghijklmnopqrstuvwxyz
0123456789
open("test1.txt", "r") do io
    # this throws an execption because it requires backtracking
    parse_try(io, p"[a-z]"[0:end] + e"m" > string)
end

open("test1.txt", "r") do io
    # this (with Try(...)) works fine
    parse_try(io, Try(p"[a-z]"[0:end] + e"m" > string))
end

Without backtracking, error messages using the Error() matcher are much more useful (this is why Parsec can provide good error messages):

julia> try
         parse_try("?", Alt!(p"[a-z]", p"\d", Error("not letter or number")))
       catch x
         println(x.msg)
       end
not letter or number at (1,1)
?
^

where the (1,1) is line number and column - so this failed on the first character of the first line.

Finally, note that this is implemented at the source level, by restricting what text is visible to the matchers. Matchers that could backtrack will still make the attempt. So you should also disable backtracking in the matchers, where you do not need it, for an efficient grammar.

Spaces - Pre And Post-Fixes

The lack of a lexer can complicate the handling of whitespace when using parser combinators. This library includes the ability to add arbitrary matchers before or after named matchers in the grammar - something that can be useful for matching and discarding whitespace.

For example,

spc = Drop(Star(Space()))

@with_pre spc begin

    sum = Delayed()
    val = E"(" + spc + sum + spc + E")" | PFloat64()

    neg = Delayed()             # allow multiple negations (eg ---3)
    neg.matcher = Nullable{Matcher}(val | (E"-" + neg > Neg))

    mul = E"*" + neg
    div = E"/" + neg > Inv
    prd = neg + (mul | div)[0:end] |> Prd

    add = E"+" + prd
    sub = E"-" + prd > Neg
    sum.matcher = Nullable{Matcher}(prd + (add | sub)[0:end] |> Sum)

    all = sum + spc + Eos()

end

extends the parser given earlier to discard whitespace between numbers and symbols. The automatc addition of spc as a prefix to named matchers (those assigned to a variable: sum, val, neg, etc) means that it only needs to be added explicitly in a few places.

Locating Errors

Sometimes it is useful to report to the user where the input text is "wrong". For a recursive descent parser one useful indicator is the maximum depth reached in the source.

This can be retrieved using the Debug config. Here is a simple example that delegates to NoCache (the default confg for parse_one()):

grammar = p"\d+" + Eos()
source = "123abc"
             # make the parser task
debug, task = make(Debug, source, grammar; delegate=NoCache)
once(task)   # this does the parsing and throws an exception
             # the debug config now contains max_iter
println(source[debug.max_iter:end])   # show the error location "abc"

This is a little complex because I don't pre-define a function for this case (cf parse_one()). Please email me if you think I should (currently it's unclear what features to support directly, and what to leave for "advanced" users).

For more information see parsers.jl and debug.jl.

An alternative approach to error messages is to use parse_try() with the Error() matcher - see here.

Coding Style

Don't go reinventing regexps. The built-in regexp engine is way, way more efficient than this library could ever be. So call out to regexps liberally. The p"..." syntax makes it easy.

But don't use regular expressions if you need to backtrack what is being matched.

Drop stuff you don't need.

Transform things into containers so that your result has nice types. Look at how the example works.

Understand the format you are parsing. What motivated the person who designed the format? Compare the GML and DOT parsers - they return different results because the format authors cared about different things. GML is an elegant, general data format, while DOT is a sequential description - a program, almost - that encodes graph layouts.

Adding Matchers

First, are you sure you need to add a matcher? You can do a lot with transforms.

If you do, here are some places to get started:

Debugging

Debugging a grammar can be a frustrating experience - there are times when it really helps to have a simple view "inside" what is happening. This is supported by parse_dbg which will print a record of all messages (execute and response - see design) for matchers inside a Trace() matcher.

In addition, if the grammar is defined inside a @with_names macro, the symbols used to identify various parts of the grammar (the variable names) are displayed when appropriate.

Here's a full example (you can view less by applying Trace() to only the matchers you care about):

@with_names begin

    sum = Delayed()
    val = E"(" + sum + E")" | PFloat64()

    neg = Delayed()             # allow multiple negations (eg ---3)
    neg.matcher = val | (E"-" + neg > Neg)

    mul = E"*" + neg
    div = E"/" + neg > Inv
    prd = neg + (mul | div)[0:end] |> Prd

    add = E"+" + prd
    sub = E"-" + prd > Neg
    sum.matcher = prd + (add | sub)[0:end] |> Sum

    all = sum + Eos()
end

parse_dbg("1+2*3/4", Trace(all))

which gives:

  1:1+2*3/4    00 Trace->all
  1:1+2*3/4    01  all->sum
  1:1+2*3/4    02   Transform->Seq
  1:1+2*3/4    03    Seq->prd
  1:1+2*3/4    04     prd->Seq
  1:1+2*3/4    05      Seq->neg
  1:1+2*3/4    06       Alt->Seq
  1:1+2*3/4    07        Seq->Drop
  1:1+2*3/4    08         Drop->Equal
   :           08         Drop<-!!!
   :           07        Seq<-!!!
   :           06       Alt<-!!!
  1:1+2*3/4    06       Alt->Transform
  1:1+2*3/4    07        Transform->Pattern
  2:+2*3/4     07        Transform<-{"1"}
  2:+2*3/4     06       Alt<-{1.0}
  2:+2*3/4     05      Seq<-{1.0}
  2:+2*3/4     05      Seq->Depth
  2:+2*3/4     06       Depth->Alt
  2:+2*3/4     07        Alt->mul
  2:+2*3/4     08         mul->Drop
  2:+2*3/4     09          Drop->Equal
   :           09          Drop<-!!!
   :           08         mul<-!!!
   :           07        Alt<-!!!
  2:+2*3/4     07        Alt->div
  2:+2*3/4     08         div->Seq
  2:+2*3/4     09          Seq->Drop
  2:+2*3/4     10 Drop->Equal
   :           10 Drop<-!!!
   :           09          Seq<-!!!
   :           08         div<-!!!
   :           07        Alt<-!!!
   :           06       Depth<-!!!
  2:+2*3/4     05      Seq<-{}
  2:+2*3/4     04     prd<-{1.0}
  2:+2*3/4     03    Seq<-{Prd({1.0})}
  2:+2*3/4     03    Seq->Depth
  2:+2*3/4     04     Depth->Alt
  2:+2*3/4     05      Alt->add
  2:+2*3/4     06       add->Drop
  2:+2*3/4     07        Drop->Equal
  3:2*3/4      07        Drop<-{"+"}
  3:2*3/4      06       add<-{}
  3:2*3/4      06       add->prd
  3:2*3/4      07        prd->Seq
  3:2*3/4      08         Seq->neg
  3:2*3/4      09          Alt->Seq
  3:2*3/4      10 Seq->Drop
  3:2*3/4      11  Drop->Equal
   :           11  Drop<-!!!
   :           10 Seq<-!!!
   :           09          Alt<-!!!
  3:2*3/4      09          Alt->Transform
  3:2*3/4      10 Transform->Pattern
  4:*3/4       10 Transform<-{"2"}
  4:*3/4       09          Alt<-{2.0}
  4:*3/4       08         Seq<-{2.0}
  4:*3/4       08         Seq->Depth
  4:*3/4       09          Depth->Alt
  4:*3/4       10 Alt->mul
  4:*3/4       11  mul->Drop
  4:*3/4       12   Drop->Equal
  5:3/4        12   Drop<-{"*"}
  5:3/4        11  mul<-{}
  5:3/4        11  mul->neg
  5:3/4        12   Alt->Seq
  5:3/4        13    Seq->Drop
  5:3/4        14     Drop->Equal
   :           14     Drop<-!!!
   :           13    Seq<-!!!
   :           12   Alt<-!!!
  5:3/4        12   Alt->Transform
  5:3/4        13    Transform->Pattern
  6:/4         13    Transform<-{"3"}
  6:/4         12   Alt<-{3.0}
  6:/4         11  mul<-{3.0}
  6:/4         10 Alt<-{3.0}
  6:/4         09          Depth<-{3.0}
  6:/4         09          Depth->Alt
  6:/4         10 Alt->mul
  6:/4         11  mul->Drop
  6:/4         12   Drop->Equal
   :           12   Drop<-!!!
   :           11  mul<-!!!
   :           10 Alt<-!!!
  6:/4         10 Alt->div
  6:/4         11  div->Seq
  6:/4         12   Seq->Drop
  6:/4         13    Drop->Equal
  7:4          13    Drop<-{"/"}
  7:4          12   Seq<-{}
  7:4          12   Seq->neg
  7:4          13    Alt->Seq
  7:4          14     Seq->Drop
  7:4          15      Drop->Equal
   :           15      Drop<-!!!
   :           14     Seq<-!!!
   :           13    Alt<-!!!
  7:4          13    Alt->Transform
  7:4          14     Transform->Pattern
  8:           14     Transform<-{"4"}
  8:           13    Alt<-{4.0}
  8:           12   Seq<-{4.0}
  8:           11  div<-{4.0}
  8:           10 Alt<-{Inv(4.0)}
  8:           09          Depth<-{Inv(4.0)}
  8:           09          Depth->Alt
  8:           10 Alt->mul
  8:           11  mul->Drop
  8:           12   Drop->Equal
   :           12   Drop<-!!!
   :           11  mul<-!!!
   :           10 Alt<-!!!
  8:           10 Alt->div
  8:           11  div->Seq
  8:           12   Seq->Drop
  8:           13    Drop->Equal
   :           13    Drop<-!!!
   :           12   Seq<-!!!
   :           11  div<-!!!
   :           10 Alt<-!!!
   :           09          Depth<-!!!
  8:           08         Seq<-{3.0,Inv(4.0)}
  8:           07        prd<-{2.0,3.0,Inv(4.0)}
  8:           06       add<-{Prd({2.0,3.0,Inv(4.0)})}
  8:           05      Alt<-{Prd({2.0,3.0,Inv(4.0)})}
  8:           04     Depth<-{Prd({2.0,3.0,Inv(4.0)})}
  8:           04     Depth->Alt
  8:           05      Alt->add
  8:           06       add->Drop
  8:           07        Drop->Equal
   :           07        Drop<-!!!
   :           06       add<-!!!
   :           05      Alt<-!!!
  8:           05      Alt->sub
  8:           06       sub->Seq
  8:           07        Seq->Drop
  8:           08         Drop->Equal
   :           08         Drop<-!!!
   :           07        Seq<-!!!
   :           06       sub<-!!!
   :           05      Alt<-!!!
   :           04     Depth<-!!!
  8:           03    Seq<-{Prd({2.0,3.0,Inv(4.0)})}
  8:           02   Transform<-{Prd({1.0}),Prd({2.0,3.0,Inv(4.0)})}
  8:           01  all<-{Sum({Prd({1.0}),Prd({2.0,3.0,Inv(4.0)})})}
  8:           01  all->Eos
  8:           01  all<-{}
  8:           00 Trace<-{Sum({Prd({1.0}),Prd({2.0,3.0,Inv(4.0)})})}

Some things to note here:

Finally, printing a matcher gives a useful tree view of the grammar. Loops are elided with ...:

println(all)

gives

all
+-[1]:sum
| `-TransSuccess
|   +-Seq
|   | +-[1]:prd
|   | | +-Seq
|   | | | +-[1]:neg
|   | | | | `-Alt
|   | | | |   +-[1]:Seq
|   | | | |   | +-[1]:Drop
|   | | | |   | | `-Equal
|   | | | |   | |   `-"("
|   | | | |   | +-[2]:sum...
|   | | | |   | `-[3]:Drop
|   | | | |   |   `-Equal
|   | | | |   |     `-")"
|   | | | |   +-[2]:TransSuccess
|   | | | |   | +-Pattern
|   | | | |   | | `-r"-?(\d*\.?\d+|\d+\.\d*)([eE]\d+)?"
|   | | | |   | `-f
|   | | | |   `-[3]:TransSuccess
|   | | | |     +-Seq
|   | | | |     | +-[1]:Drop
|   | | | |     | | `-Equal
|   | | | |     | |   `-"-"
|   | | | |     | `-[2]:neg...
|   | | | |     `-f
|   | | | `-[2]:Depth
|   | | |   +-Alt
|   | | |   | +-[1]:mul
|   | | |   | | +-[1]:Drop
|   | | |   | | | `-Equal
|   | | |   | | |   `-"*"
|   | | |   | | `-[2]:neg...
|   | | |   | `-[2]:div
|   | | |   |   +-Seq
|   | | |   |   | +-[1]:Drop
|   | | |   |   | | `-Equal
|   | | |   |   | |   `-"/"
|   | | |   |   | `-[2]:neg...
|   | | |   |   `-f
|   | | |   +-lo=0
|   | | |   +-hi=9223372036854775807
|   | | |   `-flatten=true
|   | | `-f
|   | `-[2]:Depth
|   |   +-Alt
|   |   | +-[1]:add
|   |   | | +-[1]:Drop
|   |   | | | `-Equal
|   |   | | |   `-"+"
|   |   | | `-[2]:prd...
|   |   | `-[2]:sub
|   |   |   +-Seq
|   |   |   | +-[1]:Drop
|   |   |   | | `-Equal
|   |   |   | |   `-"-"
|   |   |   | `-[2]:prd...
|   |   |   `-f
|   |   +-lo=0
|   |   +-hi=9223372036854775807
|   |   `-flatten=true
|   `-f
`-[2]:Eos

Also, parse_XXX(...., debug=true) will show a strack trace from within the main parse loop (which gives more information on the source of any error).

More Information

For more details, I'm afraid your best bet is the source code:

Parsers

Graph Modelling Language

GML describes a graph using a general dict / list format (something like JSON).

Note that the parser does not conform fully to the specifications: ISO 8859-1 entities are not decoded (the parser should accept UTF 8); integers and floats are 64bit; strings can be any length; no check is made for required fields.

For example, to print node IDs and edge connections in a graph

using ParserCombinator.Parsers.GML

my_graph = "graph [
  node [id 1]
  node [id 2]
  node [id 3]
  edge [source 1 target 2]
  edge [source 2 target 3]
  edge [source 3 target 1]
]"

root = parse_dict(my_graph)

for graph in root[:graph]  # there could be multiple graphs
    for node in graph[:node]
        println("node $(node[:id])")
    end
    for edge in graph[:edge]
        println("edge $(edge[:source]) - $(edge[:target])")
    end
end

giving

node 1
node 2
node 3
edge 1 - 2
edge 2 - 3
edge 3 - 1

For further details, please read GML.jl.

DOT

DOT describes a graph using a complex format that resembles a program (with mutable state) more than a specification (see comments in source).

For example, to print node IDs and edge connections in a graph

using ParserCombinator.Parsers.DOT

my_graph = "graph {
  1 -- 2
  2 -- 3
  3 -- 1
}"

root = parse_dot(my_graph)

for node in nodes(root)
    println("node $(node)")
end
for (node1, node2) in edges(root)
    println("edge $(node1) - $(node2)")
end

giving

node 2
node 3
edge 2 - 3
edge 1 - 3
edge 1 - 2

Nodes and edges are unordered (returned as a Set). The graph specification is undirected (cf digraph {...}) and so the order of nodes in an edge is in canonical (sorted) form.

Design

For a longer discussion of the design of ParserCombinator.jl, please see this blog post, also available here.

Overview

Parser combinators were first written (afaik) in functional languages where tail calls do not consume stack. Also, packrat parsers are easiest to implement in lazy languages, since shared, cached results are "free".

Julia has neither tail recursion optimisation nor lazy evaluation.

On the other hand, tail call optimisation is not much use when you want to support backtracking or combine results from child parsers. And it is possible to implement combinators for repeated matches using iteration rather than recursion.

In short, life is complicated. Different parser features have different costs and any particular implementation choice needs to be defended with detailed analysis. Ideally we want an approach that supports features with low overhead by default, but which can be extended to accomodate more expensive features when necessary.

This library defines the grammar in a static graph, which is then "interpreted" using an explicit trampoline (described in more detail below). The main advantages are:

The main disadvantages are:

Trampoline Protocol

A matcher is invoked by a call to

execute(k::Config, m::Matcher, s::State, i) :: Message

where k must include, at a minimum, the field k.source that follows the iterator protocol when used with i. So, for example, next(k.source, i) returns the next value from the source, plus a new iter.

The initial call (ie the first time a given value of i is used, before any backtracking) will have s equal to CLEAN.

A matcher returns a Message which indicates to the trampoline how processing should continue:

The FAILURE and Success messages are processed by the trampoline and (typically, although a trampoline implementation may also use cached values) result in calls to

failure(k::Config, m::Matcher, s::State) :: Message

success(k::Config, m::Matcher, s::State, t::State, i, r::Value) :: Message

where the parent matcher (m) can do any clean-up work, resulting in a new Message.

Note that the child's state, t, is returned to the parent. It is the responsibility of the parent to save this (in its own state) if it wants to re-call the child.

Source (Input Text) Protocol

The source text is read using the standard Julia iterator protocol, extended with several methods defined in sources.jl.

The use of iterators means that Dot() returns characters, not strings. But in practice that matcher is rarely used (particularly since, with strings, you can use regular expressions - p"pattern" for example), and you can construct a string from multiple characters using > string.

Releases

1.7.0 - 2015-10-13 - Added DOT parser.

1.6.0 - 2015-07-26 - Changed from s" to e"; added support for fast regex patch.

1.5.0 - 2015-07-25 - Clarified source handling; improved GML speed.

1.4.0 - 2015-07-18 - Added GML parser; related parse_try fixes.

1.3.0 - 2015-06-27 - Added parse_try.

1.2.0 - 2015-06-28 - Trampoline side rewritten; more execution modes.

1.1.0 - 2015-06-07 - Fixed calc example; debug mode; much rewriting.

1.0.0 - ~2015-06-03 - More or less feature complete.