adrian-thurston / ragel

Ragel State Machine Compiler
MIT License
532 stars 46 forks source link

[question]: composing parsers based recursive grammars with the scanner model (RDF/Turtle) #60

Closed kortschak closed 3 years ago

kortschak commented 3 years ago

I'm working on an implementation of an RDF Turtle parser using ragel targetting Go.

The RDF/Turtle grammar is recursive with subjects and objects being defined in terms of collections of objects. The approach that I have take is to define this section of the grammar like so in ragel (complete grammar below in details):

    collection                       = '(' WS** >EnterCollection ;

    blankNodePropertyList            = '[' WS** >EnterBlankNodePropertyList ;

    subject                          = iri | BlankNode | collection ;

    object                           = iri
                                     | BlankNode
                                     | collection
                                     | blankNodePropertyList
                                     | literal
                                     ;

    objects                          = (object WS*)* ')' ;

    objectList                       = object WS* (',' WS* object)* ;
    predicateObjectList              = verb WS* objectList WS* (';' WS* comment* (verb WS* objectList)?)* ;

    blankNodeProperties              = predicateObjectList WS* ']' ;

Here the intention is to have collection and blankNodePropertyList call host language functions or make fcalls that use ragel machines parsing objects and blankNodeProperties.

Originally I was using a standard parser, but found that the code generation did not complete, being killed by the OOM killer. Changing over to a scanner model fixed that (and simplified the host support code). However, this precludes fcall as far as I can see; the scanner is defined as

    statement := |*
        WS* ;
        comment* ;
        base ;
        prefixID ;
        sparqlBase ;
        sparqlPrefix ;
        triples '.' ;
    *|;

My question is whether it is safe/correct to pass p etc to the EnterCollection and EnterBlankNodePropertyList-called host language functions to allow them to parse the objects and blankNodeProperties, and then return the ragel state vars for the main machine to update with? If so which state vars are appropriate to change and what should be done with cs and act?

(In the longer term, the intention is to have all this wrapped up into a struct to construct a streaming RDF/Turtle parser where a call to an Unmarshal method returns a single RDF statement).

Grammar (stil WIP, so actions are incomplete):

    %%{
        machine turtle;

        alphtype rune;
        include "parse_actions.rl";

        PN_CHARS_BASE                    = [A-Za-z]
                                         | 0x00c0 .. 0x00d6
                                         | 0x00d8 .. 0x00f6
                                         | 0x00f8 .. 0x02ff
                                         | 0x0370 .. 0x037d
                                         | 0x037f .. 0x1fff
                                         | 0x200c .. 0x200d
                                         | 0x2070 .. 0x218f
                                         | 0x2c00 .. 0x2fef
                                         | 0x3001 .. 0xd7ff
                                         | 0xf900 .. 0xfdcf
                                         | 0xfdf0 .. 0xfffd
                                         | 0x10000 .. 0xeffff
                                         ;

        PN_CHARS_U                       = PN_CHARS_BASE | '_' ;

        PN_CHARS                         = PN_CHARS_U
                                         | '-'
                                         | [0-9]
                                         | 0xb7
                                         | 0x0300 .. 0x036f
                                         | 0x203f .. 0x2040
                                         ;

        PERCENT                          = '%' xdigit {2} ;

        PN_LOCAL_ESC                     = '\\'
                                         ( '_'
                                         | '~'
                                         | '.' 
                                         | '-' 
                                         | '!' 
                                         | '$' 
                                         | '&' 
                                         | "'" 
                                         | '(' 
                                         | ')' 
                                         | '*' 
                                         | '+' 
                                         | ',' 
                                         | ';' 
                                         | '=' 
                                         | '/' 
                                         | '?' 
                                         | '#' 
                                         | '@' 
                                         | '%'
                                         )
                                         ;

        PLX                              = PERCENT | PN_LOCAL_ESC ;

        PN_LOCAL                         = (PN_CHARS_U | ':' | [0-9] | PLX) ((PN_CHARS | '.' | ':' | PLX)** (PN_CHARS | ':' | PLX))? ;

        PN_PREFIX                        = PN_CHARS_BASE ((PN_CHARS | '.')* PN_CHARS)? ;

        PNAME_NS                         = PN_PREFIX? ':' ;

        PNAME_LN                         = PNAME_NS PN_LOCAL ;

        BLANK_NODE_LABEL                 = (PN_CHARS_U | [0-9]) ((PN_CHARS | '.')* PN_CHARS)? ;

        BLANK_NODE                       = '_:' BLANK_NODE_LABEL ;

        ECHAR                            = ('\\' [tbnrf"'\\]) ;

        UCHAR                            = ('\\u' xdigit {4}
                                         | '\\U' xdigit {8})
                                         ;

        WS                               = ( 0x09 | 0x0a | 0x0d | 0x20 ) ;

        STRING_LITERAL_CHAR              = (
                                           0x00 .. 0x09
                                         | 0x0b .. 0x0c
                                         | 0x0e .. '['
                                         | ']' .. 0x10ffff
                                         | ECHAR
                                         | UCHAR)
                                         ;

        STRING_LITERAL_QUOTE             = '"'
                                           (STRING_LITERAL_CHAR - '"')*
                                           '"'
                                         ;

        STRING_LITERAL_SINGLE_QUOTE      = "'"
                                           (STRING_LITERAL_CHAR - "'")*
                                           "'"
                                         ;

        LITERAL_LONG_QUOTE_CHAR          = (any - [\\"] | ECHAR | UCHAR) ;

        STRING_LITERAL_LONG_QUOTE        = '"""'
                                           (('"' | '""')? LITERAL_LONG_QUOTE_CHAR)*
                                           '"""'
                                         ;

        LITERAL_LONG_SINGLE_QUOTE_CHAR   = (any - [\\'] | ECHAR | UCHAR) ;

        STRING_LITERAL_LONG_SINGLE_QUOTE = "'''"
                                           (("'" | "''")? LITERAL_LONG_SINGLE_QUOTE_CHAR)*
                                           "'''"
                                         ;

        String                           = (
                                           STRING_LITERAL_QUOTE
                                         | STRING_LITERAL_SINGLE_QUOTE
                                         | STRING_LITERAL_LONG_SINGLE_QUOTE
                                         | STRING_LITERAL_LONG_QUOTE
                                         )
                                         ;

        IRI                              = (
                                           '!' .. ';'
                                         | '='
                                         | '?' .. '['
                                         | ']'
                                         | '_'
                                         | 'a' .. 'z'
                                         | '~'
                                         | 0x80 .. 0x10ffff
                                         | UCHAR)*
                                         ;

        INTEGER                          = [+\-]? [0-9]+ ;
        DECIMAL                          = [+\-]? [0-9]* '.' [0-9]+ ;
        EXPONENT                         = [eE] [+\-]? [0-9]+ ;
        DOUBLE                           = [+\-]? ([0-9]+ '.' [0-9]* EXPONENT
                                         | '.' [0-9]+ EXPONENT
                                         | [0-9]+ EXPONENT)
                                         ;

        IRIREF                           = '<' IRI '>' ;
        PrefixedName                     = PNAME_LN | PNAME_NS ;
        iri                              = IRIREF | PrefixedName ;

        LANGTAG                          = '@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)* ;
        NumericLiteral                   = INTEGER | DECIMAL | DOUBLE ;
        RDFLiteral                       = String (LANGTAG | '^^' iri)? ;
        BooleanLiteral                   = 'true' | 'false' ;
        literal                          = RDFLiteral | NumericLiteral | BooleanLiteral ;

        comment                          = '#' >StartComment (any - [\n\r])* %EndComment [\n\r] WS* ;

        ANON                             = '[' WS* ']';
        BlankNode                        = BLANK_NODE_LABEL | ANON ;

        predicate                        = iri ;
        verb                             = ( predicate | 'a' ) ;

        # recursive mess -- implement these with actions that call 
        collection                       = '(' WS** >EnterCollection ;

        blankNodePropertyList            = '[' WS** >EnterBlankNodePropertyList ;

        subject                          = iri | BlankNode | collection ;

        object                           = iri
                                         | BlankNode
                                         | collection
                                         | blankNodePropertyList
                                         | literal
                                         ;

        objects                          = (object WS*)* ')' ;

        objectList                       = object WS* (',' WS* object)* ;
        predicateObjectList              = verb WS* objectList WS* (';' WS* comment* (verb WS* objectList)?)* ;

        blankNodeProperties              = predicateObjectList WS* ']' ;
        # end recursive mess

        triples                          = subject
                                           WS*
                                         ( predicateObjectList
                                         | blankNodePropertyList
                                         ) WS*
                                           predicateObjectList?
                                         ;

        base                             = '@base' WS* IRIREF >StartBase %EndBase WS* '.' ;
        prefixID                         = '@prefix' WS+ PNAME_NS >StartPrefixID %EndPrefixID WS* IRIREF >StartPrefixIRI %EndPrefixIRI WS* '.' ;

        sparqlBase                       = 'base'i WS* IRIREF >StartBase %EndBase ;
        sparqlPrefix                     = 'prefix'i WS+ PNAME_NS >StartPrefixID %EndPrefixID WS* IRIREF >StartPrefixIRI %EndPrefixIRI ;
    }%%

Machine:

%%{
    machine turtle;

    include "turtle.rl";

    statement := |*
        WS* ;
        comment* ;
        base ;
        prefixID ;
        sparqlBase ;
        sparqlPrefix ;
        triples '.' ;
    *|;

    write data;
}%%

adrian-thurston commented 3 years ago

Hmmm, I wouldn't recommend adding recursion inside of a scanner. It's okay to jump to a recursive thing, then re-enter the scanner from the start, but returning back to a scanner is trouble. Maybe you want scanning pass that is sending tokens to a parser (ragel or otherwise)?

kortschak commented 3 years ago

Yes, that makes sense. The model I was thinking of using is essentially a stack of scanner parsers, but it gets complicated fairly quickly and you need to break the ragel abstraction in ways that aren't intended to be done.

Would it make sense to put this simple, line-base parts of the grammar into the initial scanner (all the parts that aren't triples) and tokenise the triples handing the tokens to the second pass? Given that triples are terminated by '.' and this appears nowhere else except within other tokens (and so is distinguishable) a complete triples statement set can be handed to the second pass. Or will that result in an ambiguity by matching '.' anywhere?

adrian-thurston commented 3 years ago

@kortschak You could do that. I'ts up to you, but personally I would use a single paradigm across the entire grammar though. Are you going for speed?

kortschak commented 3 years ago

Thanks. Yes, I'm going for speed; the input streams tend to be long in the domain that I'm targetting. When you say a single paradigm do you mean handling all of the scanner's parts the same way?

adrian-thurston commented 3 years ago

Yeah, I would go for a uniform approach to scanning and a separate uniform approach to parsing. Only if it's absolutely necessary would I deviate from that approach. But that's just me! More than one way to solve problems in this space, which is part of why it's fun!

kortschak commented 3 years ago

Thanks for your help. I plan to attack this more fully in the coming weeks. I may ask more.

adrian-thurston commented 3 years ago

No problem, will close this issue. Fee free to open another. I have my reputation with google restored, but haven't re-enabled the mailing list yet. Need configure the list to reject the messages that were getting bounced and then reported as spam.