Implement PDDL-like parsing

miquelramirez commented 6 years ago

From the e-mail:

Parsing: punto aburrido pero imprescindible - parsear problemas PDDL / FSTRIPS en un formato intermedio. Espero liquidar esto esta semana, adaptando la gramática que ya tenías en el FS+ parser. Esa será nuestra gramática estándar, y nos podremos cargar todo el código heredado de FD, el parser del ASP preprocessor, etc etc.

TODO list:

[x] Review existing ANTLR grammar
[x] Rewrite callbacks for parsing events to capture the components modelled by Tarski language and planning problem components
[x] Introduce facade that allows to kickstart the parsing process

gfrances commented 6 years ago

I'm making progress on this. It turns out however that I spent a good half a day debugging the ANTLR grammar that we have (well, my version of it). The problem I had was that the lexer is not able to recognize a simple add effect (at) for instance. After much testing (let me say the debugging tools for ANTLR grammars are not too good....), I found out that the problem was being caused by the several 'at' which are present in grammar rules related to durative actions, etc.

With the original grammar, I feed (at) to e.g. the effect rule, and get something like:

mismatched input 'at' expecting NAME

where

NAME:    LETTER ANY_CHAR* ;
fragment ANY_CHAR: '0'..'9' | 'A'..'Z' | '_' | '-';

and where NAME is actually the first "TOKEN rule" in the grammar file. So I assumed that (at) would be tokenized as the sequence of tokens:

'('
NAME
')'

but it wasn't. Once I replace the 'at' string literals in the rules that I mentioned above with some dummy 'XXX' literal, however, everything works.

This is quite baffling, as I was assuming that the lexer tokenizes the whole input stream before the parser has anything to say at all? Perhaps the parser is silently converting all string literals in the rules into tokens? Or, similarly, prevents those string literals in the rules to be recognized as other tokens??

Any light you can shed out of your wisdom, @miquelramirez ??

BTW I discovered that ANTLR lets you compose grammars (see "Importing Grammars" here). I'm thinking it might be a good option to have a small set of grammars, e.g. the base one for basic FSTRIPS, and then adding other features? It will be great in any case to experiment with new language constructs without messing with the core grammar...

gfrances commented 6 years ago

I think the problem is related to this FAQ entry, although the entry is a bit unclear.

gfrances commented 6 years ago

Ok, finally found it, my intuition was kind of correct. Copy-pasting from Sam Harwell, one of the core contributors of ANTLR:

In ANTLR, a single token has exactly one token type. By using a string literal in a parser rule, you have implicitly defined a token type (anonymous tokens in this case, since no lexer rule matches the specific literals).

This suggests to me that we should try to avoid using literals in parser rules (ATM it is full of them). But not sure this will avoid the problem: if we have lexer rules:

NAME:    LETTER ANY_CHAR* ;
AT_KEYWORD: 'AT'

we'll still have the same precedence problem - at the lexer level, all 'at's will go as NAMEs; if we invert the rules, valid PDDL predicates at will go as AT_KEYWORDs instead of NAMEs, right?

gfrances commented 6 years ago

(Note that Sam Harwell's answer provides a way out of this by using semantic predicates, but it's quite an ugly solution...)

gfrances commented 6 years ago

There's a better, solution, exposed in Ch. 5 of The Definitive ANTLR 4 Reference, 2nd Edition (paywalled). C&P:

Treating Keywords As Identifiers

Lots of languages, both old and new, allow keywords as identifiers, depending on the context. In Fortran, we could say things like end = goto + if/while. C# supports SQL queries with its Language-Integrated Query (LINQ) feature. Queries begin with keyword from, but we can also use from as a variable: x = from + where;. That’s unambiguously an expression, not a query syntactically, so the lexer should treat from as an identifier, not a keyword. The problem is that the lexer doesn’t parse the input. It doesn’t know whether to send, say, KEYWORD_FROM or ID to the parser. There are two approaches to allowing keywords to act as identifiers in some syntactic contexts. The first approach has the lexer pass all keywords to the parser as keyword token types, and then we create a parser id rule that matches ID and any of the keywords. The second approach has the lexer pass keywords as identifiers, and then we use predicates to test identifier names in the parser like this:
   keyIF : {_input.LT(1).getText().equals("if")}? ID ;
That’s pretty ugly, and likely slow, so we’ll stick with the first approach [...] To illustrate the recommended approach, here’s a simple grammar that matches nutty statements like if if then call call;:

prog: stat+ ;

stat: 'if' expr 'then' stat | 'call' id ';' | ';' ;

expr: id ;

id : 'if' | 'call' | 'then' | ID ;

ID : [a-z]+ ; WS : [ \r\n]+ -> skip ;



This would pretty much solve our issue, at the cost of having to revise and rewrite parts of the grammar. That seems like a good weekend plan to me :-) But tackling first a core of the language and then progressively expanding towards more expressive things, etc.

miquelramirez commented 6 years ago

Hello @gfrances,

at as you discovered already, is a reserved keyword for PDDL 2.1. There's no provision for the kind of "context-sensitive" identifier vs. keyword feature you think it is good to have, Tarski would be the very first PDDL-like front-end supporting this. Looks like you had some "fun" researching this.

I am not sure this is a "great" feature if it allows constructs like this:

if if then call call;

or its pddl counterpart

(assign (assign assign) assign)

that is just plain bad language design imho.

The reason for having all that stuff which we don't use in the grammar... well, because at some point I wanted to provide some sort of reasonable compatibility with some of the quite vast corpus of legacy models you can find on the IPC competitions. Do we need to do so now? No, not anymore.

As a general rule, if there's something in the grammar which doesn't have associated a callback, then chances are that it is some part of PDDL 3.1 we have incorporated which is a candidate for removal. So excising stuff we're not using at the moment (like the timed literal stuff, and the trajectory constraints, etc.) I think would "solve" the problem, which is that somebody thought that using very generic words as reserved keywords was a good idea...

gfrances commented 6 years ago

Yup, the alternative will be to provide very clear error messages of the kind The keyword "at" cannot be used in context[...]; although that requires some work in ANTLR though. I have been "hacking" (I say hacking because the documentation of the Python runtime is almost non-existing) the error handling system for testing purposes, and that should be doable. But we need to address the fact that 50% of the availale PDDL examples probably have an at predicate... which conflicts with the keyword.

BTW the testing thing looks quite interesting. I've put some effort into that, and now we can have unit tests like the following:

def test_domain_name_parsing():
    r = reader()

    # Test a few names expected to be valid:
    for domain_name in ["BLOCKS", "blocS-woRlD", "blocks_world"]:
        tag = "(domain {})".format(domain_name)
        _ = r.parse_string(tag, rule_names["domain"])

    # And a few ones expected to be invalid
    for domain_name in ["BL#OCKS", "@mydomain", "2ndblocksworld", "blocks2.0"]:
        tag = "(domain {})".format(domain_name)

        with pytest.raises(ParsingError):
            _ = r.parse_string(tag, rule_names["domain"])

Basically this will allow us to unit tests and even write regression tests for those elements in the syntax / parsing component that have always haunted the FS parser, such as hypenated names, upper/lower case confusions, and so on.

miquelramirez commented 6 years ago

Yup, the alternative will be to provide very clear error messages of the kind The keyword "at" cannot be used in context[...];

Yes

although that requires some work in ANTLR though. I have been "hacking" (I say hacking because the documentation of the Python runtime is almost non-existing) the error handling system for testing purposes, and that should be doable. But we need to address the fact that 50% of the availale PDDL examples probably have an at predicate... which conflicts with the keyword.

I would suggest to avoid the notion of timed literals and timed conditions entirely. It's flawed by design beyond "breaking" a huge number of PDDL benchmarks, it has semantic problems as well. As I said above, I put them in the grammar three years ago in the expectation that we would eventually do some temporal planning, back in Canberra. But I don't think we need PDDL 2.1 or 2.2 for doing temporal planning, to be honest.

Basically this will allow us to unit tests and even write regression tests for those elements in the syntax / parsing component that have always haunted the FS parser, such as hypenated names, upper/lower case confusions, and so on.

That is pretty great and very useful stepping stone to ease the development of better language tools. There are literally too many papers in planning where the only reason for the research is to take a detour around a major hole in that department.

gfrances commented 6 years ago

Thanks Miquel, you're right again, let's first get rid of this 2.1/2.2 nonsense, and probably the need for fancy lexer things will disappear.

BTW this free_functions construct is something yours or something from some PDDL version?

gfrances commented 6 years ago

Ok, turns out that the "constraints" extension to PDDL also uses at: namely, there is a construct "at end". I have magically converted that into "at-end". We can take it as a provisory solution, or deal with that later. I've documented that properly, but wouldn't waste more time on this nasty issue just because of that.

miquelramirez commented 6 years ago

So many great ideas piling on top each other like pancakes...

Getting a dash there in the middle (or and underscore or an ampersand or whatever but a semicolon or a colon) would be an improvement on the state of the PDDL spec...

gfrances commented 5 years ago

This is working since long. The ANTLR grammar for FSTRIPS, which includes PDDL, supports now most basic features. I suggest we close this issue and open surrogate issues for those particular features that are not supported yet but we'd like to support (at the moment I can only think of derived predicates)

miquelramirez commented 5 years ago

Yep, 100% agreed

On Fri, 21 Jun. 2019, 18:47 Guillem Francès, notifications@github.com wrote:

This is working since long. The ANTLR grammar for FSTRIPS, which includes PDDL, supports now most basic features. I suggest we close this issue and open surrogate issues for those particular features that are not supported yet but we'd like to support (at the moment I can only think of derived predicates)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/aig-upf/tarski/issues/17?email_source=notifications&email_token=AADMQKM45MD3KQIIEQFEF2DP3SITPA5CNFSM4ESXEMVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYH3RJQ#issuecomment-504346790, or mute the thread https://github.com/notifications/unsubscribe-auth/AADMQKPTESMLJYYIQ55GY4TP3SITPANCNFSM4ESXEMVA .

aig-upf / tarski

Implement PDDL-like parsing #17

Treating Keywords As Identifiers