Materials-Consortia / OPTIMADE

Specification of a common REST API for access to materials databases
https://optimade.org/specification
Creative Commons Attribution 4.0 International
76 stars 37 forks source link

Some suggestions for the filter grammar #58

Closed rartino closed 5 years ago

rartino commented 5 years ago

First, a big thank you to (I think primarily) @sauliusg for the work with the formal grammar for the filtering language. Now, that I have gone through actually implementing these filters based on the grammar, I have a few thoughts.

I realize it may at first seem a quite major thing to propose changes to the filtering grammar at this point. However, I stress that none of the changes proposed below actually changes any behavior of presently working API implementations. These aren't changes of how API implementations should interprete filtering strings, they are changes in how to best define the present behavior in a way that is as unambiguous, consistent and as useful for implentors as possible.

The 'filter=' keyword

I propose 'filter=' should not formally be part of the filtering language grammar at all. To me 'filter=' it is the delivery mechanism of the filter in the url query which is outside the filtering language itself. My two primary motivations:

Hence, my proposal is to change the top of the grammar tree to:

OPTIMaDeFilter = Expression ; 

and completely remove <Filter> and <Keyword> from the grammar.

Spaces

Is this way of handling whitespace with [Spaces] a good idea? Is any other 'standard' published EBNF using anything resembling this? It seems rare when looking around. The wikipedia article says "Whitespaces and comments are typically ignored in EBNF grammars".

Isn't rather the common way to deal with this to defer whitespace to be handled by the tokenizer? (unless whitespace really is an integral part of the syntax of a language). I believe we could just say something in the specification along the line of: Except for strings, tokens do not span whitespace. All other whitespace (space, tab and newline) should be discarded during tokenizing. As the specification presently stands, in my implementation that uses a lexer that handles whitespace "in the normal way", I'm very tempted to fetch the official grammar and then just do: grammar = grammar.replace("[Spaces]", ""). Is that what we want implementors to do?

The non-standard definition of UnicodeHighChar

I think it is not such a good idea to insert a Grammatica-specific syntax in the middle of the otherwise standard EBNF. IF we could create a completely resolved standard EBNF I would be all for including that in the specification, but I do not see how it can be done with the choice of allowing arbitrary unicode in strings.

That means we need to go to some "non-standard-EBNF" of defining the <String> token anyway. And since that must be done, I suggest splitting the present EBNF into two machine-separable parts. One as the formal standard EBNF grammar of non-terminals. The other would define all the suggested tokens to use in the lexer in a format useful for implementors; I propose POSIX-Extended Regular Expressions.

Below follows the POSIX-Extended Regular Expressions token definitions I presently use in my implementation, which I believe are equivalent with the present specification. I suggest we incorporate in the specification:

  Identifier: [a-zA-Z_][a-zA-Z_0-9]*
  String: "([^\"]|\\.)*"
  Number: (\+|-)?([0-9]+(\.[0-9]*)?|\.[0-9]+)((e|E)(\+|-)?[0-9]+)?

(Note that due to differences in escaping backslash in regex flavors the String one is edited from what I use in Python, the one above should be right for Posix ERE.) These definitions then technically obsoletes all of <UnicodeHighChar>, <EscapedChar>, <UnescapedChar>, <Punctuator>, <Exponent>, <Sign>, <Digits>, <Digit>, which would be removed from the formal standard EBNF grammar part of the specfication.

But I'm certainly not opposed to ALSO include a 'grammatica' definition of the tokens, which would be the present EBNF-like version of those definitions with the grammatica extension.

EDIT 2018-03-22: (To keep everything up here, I've appended another issue)

Allowing value=value, value=identifier, and identifier=identifier

Arguably, the most commonly expected construct in, what is meant to be, a somewhat straightforward filtering language is on the form identifier <operator> value. But, the grammar explicitly allows also the following constructs: value <operator> value, value <operator> identifier, and identifier <operator> identifier. As I am trying to implement the handling of these, I get into some difficulties because OPTIMaDe doesn't (yet) properly define types for its fields. (I brought that up on the last CECAM meeting, but I couldn't find an issue filed for it; I need to look more or and if it is not there, also file it an issue.)

But, the one that truly baffles me how to implement correctly is this one:

So: in summary: going forward we absolutely need to think about the typing system for OPTIMaDe. In my opinion we need a type system where types (including the semantics for comparisons) are clearly derivable from the value expression. Then one can confidently either carry out a comparison, or throw a type error if the types do not match. With that, value <operator> value, identifier <operator> identifier becomes well-defined. This means that if we want to keep the particular comparison semantics for, e.g., the elements property, we need to define that as a "set"-type and give it a form of expression that is recognizable preferably down on the token level.

Until we have sorted that out, would it be better to disallow all other forms than identifier <operator> value?

sauliusg commented 5 years ago

Hi,

thanks, Rickard, for teh suggestion.

On 2019-03-20 12:13, Rickard Armiento wrote:

I realize it may at first seem a quite major thing to propose changes to the filtering grammar at this point. However, I stress that none of the changes proposed below actually changes any behavior of presently working API implementations. These aren't changes of how API implementations should interpreter filtering strings, they are changes in how to best define the present behavior in a way that is as unambiguous, consistent and as useful for implentors as possible.

  The 'filter=' keyword

I propose 'filter=' should not formally be part of the filtering language grammar at all. To me 'filter=' it is the delivery mechanism of the filter in the url query which is outside the filtering language itself. My two primary motivations:

*

I now want to refer to 'OPTIMaDe filter strings' in various
contexts, not just as delivered in the query API. It seems awkward
to keep 'filter=' as a prefix in those contexts when it has no
function there, and it seems equally awkward to talk about, e.g.,
"the standard OPTIMaDe grammar but starting from the node".

Agreed; the grammar might be useful in a broader context we should strive for that.

*

It introduces an obstructive keyword that may be in the way of
relevant queries. It is fairly easy to understand that you cannot
name your Identifiers AND, OR, or NOT, but 'filter=' will down the
line surely get in the way for someone's attempts to query on a
field named 'filter' with potentially weird error messages (since if
one in an Expression has "filter='test" it will be tokenized as
(Keyword:'filter=', Identifier:'test') instead of
('Identifier:'filter', Operator:'=', Identifier:'test') as expected.)

Hence, my proposal is to change the top of the grammar tree to:

|OPTIMaDeFilter = Expression ; |

and completely remove || and || from the grammar.

Your suggestion is sound. It would actually be more compliant with the query string syntax which specifies

"The query string is composed of a series of field-value pairs. Within each pair, the field name and value are separated by an equals sign, "=". The series of pairs is separated by the ampersand, "&"" (https://en.wikipedia.org/wiki/Query_string#Web_forms).

Unfortunately, I could not quickly find a relevant RFC for this (https://tools.ietf.org/html/rfc1738 seems to be silent on this point). If you find a Web standard for the QS description, please forward me a link.

The only worry is to make sure that the query string is unambiguously parsed. Also, I would not like to have request basic syntax URL-encoded since this ends up in human unreadable strings. When this is unavoidable, then yes, but most common queries should be presentable without the encoding. Since we use '=' in the grammar as an 'equals' token, the first instance of '=' in the 'filter=' token is special. I believe this was the main reason to include it into the grammar.

Let's experiment a bit with the grammar; please give me a couple of days (I'm very busy at the moment with my grant); it seems to me that your suggestion should work.

Regards, Saulius

-- Dr. Saulius Gražulis Vilnius University Institute of Biotechnology, Saulėtekio al. 7 LT-10257 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353 mobile: (+370-684)-49802, (+370-614)-36366

sauliusg commented 5 years ago

On 2019-03-20 12:13, Rickard Armiento wrote:

Below follows the POSIX-Extended Regular Expressions token definitions I presently use in my implementation, which I believe are equivalent with the present specification. I suggest we incorporate in the specification:

|Identifier: [a-zA-Z_][a-zA-Z_0-9] String: "([^\"]|\.)" Number: "(+|-)?([0-9]+(.[0-9]*)?|.[0-9]+)((e|E)(+|-)?[0-9]+)?", |

The numbers are in the specification as a separate strings.

(Note that due to differences in escaping backslash in regex flavors the |String| one is edited from what I use in Python, the one above should be right for Posix ERE.)

Tests? :)

These definitions then technically obsoletes all of ||, ||, ||, ||, ||, ||, ||, ||, which would be removed from the formal standard EBNF grammar part of the specfication.

I am in favour of using as much standard features as possible. POSIX seems a good choice. However:

Two problems:

1) EBNF standard does not specify means to describe high-characters or tokens; the only means I found to represent these Unicode characters is "e) Extensions. A user may wish to extend Extended BNF. A special-sequence is provided for this purpose, the format and meaning of which are not defined in the standard except to ensure that the start and end of an extension can always be seen easily." (p. 8, ISO/IEC 14977, First edition 1996). I currently use Grammatica codes to ease parser generation.

2) PERE, although a standard on its own right, has no connection to EBNF. That is, we can not say in EBNF "for this non-terminal, use ERE". All grammar rules need to go down to terminals, or characters.

Thus, in order to have EBNF that can be parsed and checked without a manual intervention, we need all above mentioned rules for and so on, so that real numbers are defined in terms of digits and other terminal characters, not in terms of REs.

This is not to say that a production parser must use such description; a standard split into RE-based lexer and LALR(k) based or similar syntax parser is for sure appropriate. This is, however, a technical detail of implementation and should not be in any way mandated by a documentation. Different tools will use different splits and specification methods (as you have said, Python has its peculiar escaping rules, Java probably different ones, Perl uses PCREs and so on). We should strive for the grammar to describe the query language unambiguously and in a standard way; we should not, however, try to accommodate ideosyncrasies of specific tools, IMHO. Ideally, a tool should transform the EBNF grammar to a form suitable for parser generator (as I did for Grammatica), or a checker should be available to prove that the implementation grammar and the specification EBNF are equivalent (i.e. describe the same language).

With this in mind, I would be very much against attempts to short-cut the current EBNF syntax with different kinds of grammars, such as REs. That would end up in a non-standard mess.

The only worry I see is the Grammatica-tailored 'UnicodeHighChar = ? [^\p{ASCII}] ? ' definition. On the other hand, the "[^\p{ASCII}]" is nothing else as a RE; I am not sure how standard it is, but it is definitely supported by Perl, viz.:

saulius@varanas OPTiMaDe-specification/ $ echo Sąžininga Žąsis | perl -CS -nle 'while( /([^\p{ASCII}])/g ) { print $1 }' ą ž Ž ą

If it is not supported, the RE string is so simple that it is next to trivial to convert it to another form supported by a tool (e.g. to hex character range notation, [^\x00-\xFF] or similar. If you have a POSIX RE specifying this character range, please forward me a link, I will be happy to switch.

Would this be a workable approach for you?

Regards, Saulius

-- Dr. Saulius Gražulis Vilnius University Institute of Biotechnology, Saulėtekio al. 7 LT-10257 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353 mobile: (+370-684)-49802, (+370-614)-36366

rartino commented 5 years ago

Thanks for the quick reply.

The numbers are in the specification as a separate strings.

Oh, sorry, I had actually missed Appendix 3, that is great. If you are ok with those REs in the specification, I guess you would be ok with including REs in that appendix also for String and Identifier? I can put them on both ERE and PCRE form.

Let's experiment a bit with the grammar; please give me a couple of days (I'm very busy at the moment with my grant); it seems to me that your suggestion should work.

No immediate urgency, but of course it is good for the specification if we can sort this out when you have the time. I do not see how removing the forced 'filter=' keyword at the start can influence any formal property of the grammar, but of course the tests should be run.

Regarding the discussion about what is allowed and not in URL Query strings and what needs to be quoted, I do not think that is so relevant on the topic of grammars and parsing the filter string. Indeed I am very happy about the original idea of designing the filtering language so that it is as readable as possible even when URL-encoded. However, as I read the specification (and the only thing that makes sense to me) this URL-encoding is just a part of the delivery mechanism of the filter string and handling this quoting is completely outside the definition of the language. Any implementation that receives an OPTIMaDe filter string encoded inside an URL query string MUST first URL-decode that string with no regard to the syntax of the filtering language, before the parser is invoked. (For the definition of URL-decoding see, e.g., https://en.wikipedia.org/wiki/Query_string#URL_encoding ; this is mostly regulated by RFC3986 modifying RFC 2396, 1738, but note that + in the query string should decode as space is apparently not an RFC standard but so common that it must be handled.)

Thus, in order to have EBNF that can be parsed and checked without a  manual intervention, we need all above mentioned rules for  and so on, so that real numbers are defined in terms of digits and other  terminal characters, not in terms of REs.

With this in mind, I would be very much against attempts to short-cut  the current EBNF syntax with different kinds of grammars, such as REs.  That would end up in a non-standard mess.

But my point above was that with the present definition of the language this is not possible due to the unicode strings. However we do, we must invoke a "non-standard extension of EBNF" for that. The version in the specification right now is non-standard, it uses a 'grammatica'-syntax-extended EBNF which cannot be understood by other parsers.

Here I also made the point that if we allow defining UnicodeHighChar as non-standard EBNF using an RE, then we can just as well allow ourselves to just use the REs for all tokens that a normal implementation would implement using regular expressions and cut down the grammar there. However, I accept pushback on that --- I'm not strongly opposed to keeping the EBNF breakdown into single digits and characters. But there remains the question of how exactly to represent UnicodeHighChar.

If it is not supported, the RE string is so simple that it is next to  trivial to convert it to another form supported by a tool (e.g. to hex  character range notation, [^\x00-\xFF] or similar. If you have a POSIX  RE specifying this character range, please forward me a link, I will be  happy to switch.

As far as I know, one actually cannot do exactly what you want with POSIX EREs. I do not think the bracket range operator for arbitrary character codes is supported by any carefully established standard for REs (unless you consider "Perl REs" acceptable.) Here is the best link I've found that summarizes what is allowed in POSIX EREs: https://en.wikibooks.org/wiki/Regular_Expressions/POSIX-Extended_Regular_Expressions

So, in POSIX ERE's you have to change around how your tokens are defined, such that you can use the generic dot '.' to match unicode characters, e.g., as my RE for <String> as: "([^\"]|\\.)*"

If one absolutely must have to separate out something like an UnicodeHighChar, perhaps it can be constructed out of ~ "not in any of the POSIX character classes", i.e., something like.

"[^:alnum::punct::space::cntrl:]"

But I'm not sure if this could potentially still match something < xFF. It sure would have been convenient if the :ascii: character class was part of the standard...

sauliusg commented 5 years ago

On 2019-03-20 15:27, Rickard Armiento wrote:

is invoked. (For the definition of URL-decoding see, e.g., https://en.wikipedia.org/wiki/Query_string#URL_encoding ; this is mostly regulated by RFC3986 modifying RFC 2396, 1738, but note that + in the query string should decode as space is apparently not an RFC standard but so common that it must be handled.)

Thanks, I'll have a look.

in any case, it seems that we can split the query string (QS) on '&' characters, the on the first '=' character in each 'name=value' part, and the use 'value' even if it contains further '=' characters.

If no objections can be found to such processing, let's change specification to this method. I will then remove 'filter=' from the grammar of query language.

Regards, Saulius

-- Dr. Saulius Gražulis Vilnius University Institute of Biotechnology, Saulėtekio al. 7 LT-10257 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353 mobile: (+370-684)-49802, (+370-614)-36366

sauliusg commented 5 years ago

Hi,

On 2019-03-20 15:27, Rickard Armiento wrote:

But my point above was that with the present definition of the language this is not possible due to the unicode strings. However we do, we must invoke a "non-standard extension of EBNF" for that. The version in the specification right now is non-standard, it uses a 'grammatica'-syntax-extended EBNF which cannot be understood by other parsers.

Well, not quite so.

First, the RE used for high Unicode chars is not a specific part of Grammatica, its a part of PCRE. PCRE are so widely supported that it can be regarded as a (budding) standard.

The reason why Grammatica is mentioned there at all is that Grammatica (Java?) supports the [^\p{ASCII}] notation but not the [^\x00-\xFF] notation at the moment. For all other languages I know and use, both expressions work. Probably for Python as well.

Thus, [^\p{ASCII}] seems to be the least common denominator supported by most tools. But, as said, I am happy with using any other RE to denote this character range, and I am prepared to do extra conversion to adapt this symbol to the tools I use.

But, since POSIX REs do not seem to contain satisfactory means to define character ranges (except for the pre-defined character classes that are too narrow, I thing), we will need to use some ad-hoc means. PCREs are supported and many people know them. But we could write something totally abstract like:

UnicodeHighChar = ? Unicode-character-outside-original-ASCII-range ? ;

and explain in human terms what does this definition mean. Then every tool will have to convert 'Unicode-character-outside-original-ASCII-range' into something appropriate for that tool. What we need is actually not a RE but a character set definition.

Here I also made the point that if we allow defining UnicodeHighChar as non-standard EBNF using an RE, then we can just as well allow ourselves to just use the REs for all tokens that a normal implementation would implement using regular expressions and cut down the grammar there.

Oh, this is quite different thing!

The characters are terminals in EBNF, and that is why they are not specified in EBNF – this is something external to the grammar. Mathematically alphabet is just a finite set (of characters), and no internal structure or representation of these characters is relevant for describing a grammar.

This is good, because there Unicode was not so widespread in 1996 when the standard was written :), but we can easily include these characters still into our grammar.

But the 'Identifier' or 'Number' tokens are something totally different – they are in fact defined in the grammar. Just this sub-language happens to be a regular language, and the part of grammar that defines it is defines a regular language – this is the reason why you can you can encode them in REs as well :). But this fact does not make the any less the part of the formal definition in EBNF.

However, I accept pushback on that --- I'm not strongly opposed to keeping the EBNF breakdown into single digits and characters. But there remains the question of how exactly to represent UnicodeHighChar.

So, lets:

a) keep the full grammar definition available in EBNF

b) provide regular sublanguage definitons (numbers, identifiers) also in other useful notations – ERE and PCRE, provided they are equivalent to the EBNF definition :);

c) decide how we announce Unicode characters outside the ASCII range. Not the this descriptions is primarily intended for humans documentation. However, automatic transformation into a form usable by tools is highly desirable.

I would say that the current grammar and UnicodeHighChar definition satisfy all 3 points a)–c); but if you think the current [^\p{ASCII}] notation for UnicodeHighChar is too obscure, lets think about other possibilities, as I suggested above.

Regards, Saulius

-- Dr. Saulius Gražulis Vilnius University Institute of Biotechnology, Saulėtekio al. 7 LT-10257 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353 mobile: (+370-684)-49802, (+370-614)-36366

rartino commented 5 years ago

The characters are terminals in EBNF, and that is why they are not specified in EBNF – this is something external to the grammar. Mathematically alphabet is just a finite set (of characters), and no internal structure or representation of these characters is relevant for describing a grammar.

I now see your point and agree.

I would say that the current grammar and UnicodeHighChar definition satisfy all 3 points a)–c); but if you think the current [^\p{ASCII}] notation for UnicodeHighChar is too obscure, lets think about other possibilities, as I suggested above.

So, the only issue I have on this matter is that [^\p{ASCII}] isn't very standard. I guess it originates from PCRE? In particular, Python's standard 're' module does not accept that syntax. Worse, it doesn't give an error, it just parse the expression as NOT any of the chars in the bracket, so it is easy for someone to implement the wrong thing.

But I'm struggling to come up with something prefect. EITHER one ends up with a RE that differs between POSIX EREs and more common implementations because of the difference of whether \ is an escape char inside brackets or not; OR one ends up with special notations for character classes that also differs between all RE implementations.

After looking around, I'm landing at [^\x00-\xFF] being the most universally supported common syntax, despite not being POSIX ERE. Does that syntax also work with grammatica?

rartino commented 5 years ago

I ran into another issue in my implementation. To keep everything collected I appended it to the top post.

rartino commented 5 years ago

After having started on a mongodb implementation I just add a note that at least MongoDB versions < 3.6 does not seem to have a natural way of handling comparisons of type value=value, and even identifier=identifier it is tricky (requires an aggregate). But beyond 3.6 this seems to be possible with {'$expr': {op: [$field1, $field2]} }.

But maybe this is a sign that in a query language meant to be very broadly implementable in different backends, one shouldn't expect other alternatives than identifier=value to be supported.

sauliusg commented 5 years ago

After looking around, I'm landing at [^\x00-\xFF] being the most universally supported common syntax, despite not being POSIX ERE. Does that syntax also work with grammatica?

I have right now checked the current OPTiMaDe specification with your suggested RE [^\x00-\xFF], and the tests pass! This means Grammatica (and Java) does support this notation in the end. Thus, you are right, this notation is universally supported, and we can go for it.

Above I have commented that the [^\x00-\xFF] range was not working in Grammatica; however, today (as of 2019-05-17) I can no longer reproduce it. The grammar with this range seems to work fine, and Unicode characters are recognised as such. Either it was my mistake (although I believe that I have genuinely tested it, once I have written that report); or I had older Java version. In any case, now the problems seems to not exist (any more?).