Materials-Consortia / OPTIMADE

Specification of a common REST API for access to materials databases
https://optimade.org/specification
Creative Commons Attribution 4.0 International
77 stars 37 forks source link

Support of POST methods for passing JSON query objects #267

Closed fekad closed 4 years ago

fekad commented 4 years ago

If there is a complex code on the client-side which can build a query object, it doesn't make sense to downgrade it into a string than reparsing it on the server-side. It would be nice to have an option to pass a query as a JSON as well. Of course in this case:

rartino commented 4 years ago

Right, supporting an alternative json/dictionary-based format for filters seems very useful. I'm not sure about naming the alternative parameter 'query' (if that is what you mean?), but perhaps something with filter, e.g., 'filter_json' or 'filter_dict'? I'm also not sure about the POST vs GET distinction, why not allow this parameter with either method?

A related point is that during the last physical meeting I suggested that we should agree on a standard json/dictionary representation of the filter to use internally in our codes to simplify code exchange between different efforts. As far as I know, the internal format of optimade-python-tools is a dictionary-based one, but it is also tied to the lark parser (please correct me if this has changed). My python-optimade-candidate-reference-implementation uses a more general dictionary-based format as its intermediate format. If people think it could be useful, I can write up a format specification for it as a first draft suggestion for a standardized dictionary-based format.

ml-evs commented 4 years ago

I think this sounds useful too.

Just to comment on the optimade-python-tools implementation, our filter strings are passed directly the backend-specific filter transformers. Most of these transformers use the same lark parser under the hood, which constructs a lark.Tree object, which is the only intermediate representation we have.

I can't think of any technical reason why we couldn't also support a more general intermediate with a bit of refactoring.

As an example, here's what a Lark tree looks like:

>>> print(parser.parse("band_gap = 1").pretty())
filter
  expression
    expression_clause
      expression_phrase
        comparison
          property_first_comparison
            property    band_gap
            value_op_rhs
              =
              value
                number  1

As you can see, the rules are named according to their name's in our grammar definition, and the code for transforming it into the backend-specific query is very simple, see e.g. our mongo transformer (if you ignore the post-processing!)

Perhaps you could post an example of representation used by your implementation @rartino?

CasperWA commented 4 years ago

The addition of POST methods, especially with a focus on a well-defined JSON-query structure to send, would make it very easy to implement a GraphQL version of the OPTIMADE spec. It would almost effectively already be there. Supporting this parameter in GET requests make sense as well to me. However, the POST method would need to be supported for use with GraphQL (as far as I understand).

CasperWA commented 4 years ago

Note: There is a similar issue #114

fekad commented 4 years ago

The dictionary/JSON representation is just another way to serialize the query. Simplest case it can be represented the abstract syntax tree (AST) determined by the grammar. This two representation are equivalent and - due to the specification of the grammar - there must be a one-one mapping between them.

The main motivation here is to use the common practices used in JSON REST API specifications which can be very practical for client codes. I completely agree that the current version is more human-readable but the JSON if more suitable for the client applications.

You can find some examples below (please note that this is just how I thinking about it...):

single
{'PROPERTY': {'name': 'single'}}
not single
{'NOT': {'PROPERTY': {'name': 'single'}}}
property > 23 
{'GT': {'property':{'PROPERTY': {'name': 'property'}}, 'value': 23}}
string = "some string"
{'EQ': {'property':{'PROPERTY': {'name': 'string'}, 'value': 'some string'}}
{'AND': [{'PROPERTY': {'name': 'aa'}}, {'NOT': {'PROPERTY': {'name': 'bb'}}}]}

Notes:

merkys commented 4 years ago

As far as I understand, the suggestion is that a provider still MUST support string filters, but MAY also support query objects. So the universal client will still have to be ready to stringify the query objects for providers that only support string filters, right?

fekad commented 4 years ago

That's a good question. I would say each representation has its own audience (human vs machine). Currently, it is not a big issue because there are only a few clients right now but it might change in the near future. So I would say both of them should be labelled as MAY and over time we will see which is the most preferred representation.

In many cases, after parsing the current query string you store it as a tree-like object anyway (like in optimade-python-tools) so if we can agree on the format it is quite easy to support both representations.

I would also say that the dictionary-based representation would be most useful for the newcomers (client and databases developers) because it is much easier to interface. Of course, you can say that we already have parsers for the query string and they can use it, but that is just another thing that they need to learn it and not everybody wants to depend on others code ...

sauliusg commented 4 years ago

Most of these transformers use the same lark parser under the hood, which constructs a lark.Tree object, which is the only intermediate representation we have.

That's definitely false for us. We use Perl, and our representation is Perl hash/object.

Please do not assume that "all the world is Python".

sauliusg commented 4 years ago

{'AND': [{'PROPERTY': {'name': 'aa'}}, {'NOT': {'PROPERTY': {'name': 'bb'}}}]}

Looking at this string and knowing nothing about the implementation you use, it is completely unclear to me what the different sorts of braces ('[' vs. '{') mean, and why the ';' is necessary at all. As an implementation-unaware user I will use braces '(' for the outer level and mix '[' and '{' freely of they are available.

If you want something that is extremely easy to parse and you do not care so much about readability, then the notation is invented in 60-ies, and it is S-expression. Why would you want to reinvent the wheel?

sauliusg commented 4 years ago

A related point is that during the last physical meeting I suggested that we should agree on a standard json/dictionary representation of the filter to use internally in our codes to simplify code exchange between different efforts. As far as I know, the internal format of optimade-python-tools is a dictionary-based one,

So this is a new standard, isn't it?

You can happily have a standardised intermediate representation of a parse tree (many compiler toolkits do), but this is not a part of the filter language; and it should not be in OPTIMADE, IMHO. Its a different standard, and I suggest that it is developed separately.

Tying things to Python might seem to make life easier for Python coders, but makes life more difficult for everyone else.

but it is also tied to the lark parser

So implementation dependent, no?

mkhorton commented 4 years ago

You can happily have a standardised intermediate representation of a parse tree (many compiler toolkits do), but this is not a part of the filter language; and it should not be in OPTIMADE, IMHO. Its a different standard, and I suggest that it is developed separately.

I would agree with this -- I actually think the proposed syntax is quite reasonable, but the whole point of having a formal Lark grammar is to provide a single source of truth, and introducing a second format (even one that is 1-to-1 equivalent) is muddying the waters. It should be down to the individual implementations to resolve the request into a format they can more easily work with and this is why e.g optimade-python-tools is so valuable.

If you want something that is extremely easy to parse and you do not care so much about readability, then the notation is invented in 60-ies, and it is S-expression. Why would you want to reinvent the wheel?

I'm also a fan of S-expressions but given that this is a REST API and JSON is the web's native interchange format, so I wouldn't frame it as re-inventing the wheel. That is to say, it would not just benefit Python developers, but also web developers in general, which will be a major application for this API.

sauliusg commented 4 years ago

I would also say that the dictionary-based representation would be most useful for the newcomers (client and databases developers) because it is much easier to interface. Of course, you can say that we already have parsers for the query string and they can use it, but that is just another thing that they need to learn it and not everybody wants to depend on others code ...

This I do not get. You will need a parser for serialised representation of a tree, wouldn't you? And a separate standard to specify how operators, values, properties map to it, right? So you will call a parser of this JSON thing and get a value (object, tree, hash aka dictionary, array, etc.) out of it, as specified by the interface of the parser you use. So how is this different from parsing the current filter string?

Whether the filter parser is complicated or not is another issue, but this is completely irrelevant here. You call either parser as a black box and get the specified result out of it. How it works you do not care a the point of use.

When you implement the parser, there might be some languages that are easier to parse and some more difficult. So you may choose to implement you own parser (if you do not want to depended on other peoples' code) or to use someone's parser (if you are confident it will remain stable and reliable, and you do not want to roll out your own thing). I see this as a usual trade-off you make each time you pick a library, as system or a language.

Now, if you introduce yet another filer representation, you will have two parsers to soport instead of just one, two dependencies, two peaces of code in RAM, etc... What advantage does this give to you?

rartino commented 4 years ago

it should not be in OPTIMADE, IMHO. Its a different standard, and I suggest that it is developed separately.

After seeing @sauliusg response, I think I agree with this. We who see a utility in this kind of representation can work out this standard outside the OPTIMADE specification. On my side the main motivation is because I want to standardize an intermediate AST to be used across codes, even across different programming languages to make "translator" algorithm exchange easier.

The day we have actual implementations adding _exmpl_ast_filter=... as an alternative to filter= we could start discussing whether it can be included in the standard.

Upthread I mentioned that my code is organized to simplify the raw parse tree into such an intermediate format before running the translators. The benefit has been that there are very small changes needed to the translators when the grammar has been updated. That format looks like this:

('AND',
 ('=', ('Identifier', 'elements'), ('String', '"Ga,Ti"')),
 ('OR',
  ('=', ('Identifier', 'nelements'), ('Number', '3')),
  ('=', ('Identifier', 'nelements'), ('Number', '2'))))

(So, yes, more like an S-expression)

rartino commented 4 years ago

In case anyone is interested, I'll add what my AST parser outputs for the same examples as @fekad posted above:

name="simple"

('=', ('Identifier', 'name'), ('String', '"simple"'))

NOT name="simple"

('NOT', ('=', ('Identifier', 'name'), ('String', '"simple"')))

property>23

('>', ('Identifier', 'property'), ('Number', '23'))

string="some string"

('=', ('Identifier', 'string'), ('String', '"some string"'))

name="aa" AND NOT name="bb"

('AND',
 ('=', ('Identifier', 'name'), ('String', '"aa"')),
 ('NOT', ('=', ('Identifier', 'name'), ('String', '"bb"'))))
fekad commented 4 years ago

{'AND': [{'PROPERTY': {'name': 'aa'}}, {'NOT': {'PROPERTY': {'name': 'bb'}}}]}

Looking at this string and knowing nothing about the implementation you use, it is completely unclear to me what the different sorts of braces ('[' vs. '{') mean, and why the ';' is necessary at all. As an implementation-unaware user I will use braces '(' for the outer level and mix '[' and '{' freely of they are available.

You are right it was not defined properly and I wrote them by hand just to have an illustrate the differences. The format that I wanted to use there is the JSON. In the example AND has a list where all elements are in 'and' relation. It is just a common trick to avoid complexity by flattening a repetitive nested structure.

If you want something that is extremely easy to parse and you do not care so much about readability, then the notation is invented in 60-ies, and it is S-expression. Why would you want to reinvent the wheel?

I completely agree and I do not want to reinvent anything I would just use JSON as a format. (I will write about this chose later)

fekad commented 4 years ago

A related point is that during the last physical meeting I suggested that we should agree on a standard json/dictionary representation of the filter to use internally in our codes to simplify code exchange between different efforts. As far as I know, the internal format of optimade-python-tools is a dictionary-based one,

So this is a new standard, isn't it?

You can happily have a standardised intermediate representation of a parse tree (many compiler toolkits do), but this is not a part of the filter language; and it should not be in OPTIMADE, IMHO. Its a different standard, and I suggest that it is developed separately.

Tying things to Python might seem to make life easier for Python coders, but makes life more difficult for everyone else.

but it is also tied to the lark parser

So implementation dependent, no?

I can make comment only on the technical part here. The definition/specification of the content of the JSON would be independent of the implementation. The Lark-parser (https://github.com/lark-parser/lark) is just the name of the python package which returns with a tree-like object. So in that case, the implementation of a custom AST would be trivial. Of course, everybody can use any language (https://i.imgur.com/ZyeCO.jpg).

fekad commented 4 years ago

You can happily have a standardised intermediate representation of a parse tree (many compiler toolkits do), but this is not a part of the filter language; and it should not be in OPTIMADE, IMHO. Its a different standard, and I suggest that it is developed separately.

I would agree with this -- I actually think the proposed syntax is quite reasonable, but the whole point of having a formal Lark grammar is to provide a single source of truth, and introducing a second format (even one that is 1-to-1 equivalent) is muddying the waters. It should be down to the individual implementations to resolve the request into a format they can more easily work with and this is why e.g optimade-python-tools is so valuable.

If you want something that is extremely easy to parse and you do not care so much about readability, then the notation is invented in 60-ies, and it is S-expression. Why would you want to reinvent the wheel?

I'm also a fan of S-expressions but given that this is a REST API and JSON is the web's native interchange format, so I wouldn't frame it as re-inventing the wheel. That is to say, it would not just benefit Python developers, but also web developers in general, which will be a major application for this API.

Indeed it is muddying the waters, but I would say this issue is more about the interchange format than internal representation. Although the grammar precisely defines how the string can be interpreted, it doesn't tell you anything about its meaning. Each "keyword" (AND NOT, HAS, etc ) has its own definition separately in the specification. I know it is not that black and white and the only thing that I wanted to say that you can have a gammar+string or JSON definition and on top of this you still need to define the meaning of their content.

Although S-expression is very compact and elegant the JSON format naturally fits into the REST API / JSON Schema framework.

fekad commented 4 years ago

I would also say that the dictionary-based representation would be most useful for the newcomers (client and databases developers) because it is much easier to interface. Of course, you can say that we already have parsers for the query string and they can use it, but that is just another thing that they need to learn it and not everybody wants to depend on others code ...

This I do not get. You will need a parser for serialised representation of a tree, wouldn't you? And a separate standard to specify how operators, values, properties map to it, right? So you will call a parser of this JSON thing and get a value (object, tree, hash aka dictionary, array, etc.) out of it, as specified by the interface of the parser you use. So how is this different from parsing the current filter string?

This is one of the main points here, it thinks. If you use JSON you do not need to implement any parser you just need to interpret its content. Nobody wants/will write a JSON parser from scratch but now everybody have to have a (usually quite complicated) parser. And it is not just about the parser. The client is forced to construct a difficult string (like: list1:list2:... HAS ONLY val1:val2:...) which is far from trivial. The dictionary data structure allows you to incrementally build up an object that you can pass to the server. Although it is possible, it is quite tricky to do the same with (usually immutable) data structures like string.

Most of the people how are working on a web-based project already used to work with the dictionary data structure. I'm just saying the barrier for any newcomers is significantly lower if they can jump in and use one of the "well known" formats rather than starting with building a parser for a grammar.

Whether the filter parser is complicated or not is another issue, but this is completely irrelevant here. You call either parser as a black box and get the specified result out of it. How it works you do not care a the point of use.

I think it is very relevant to see how much effort you need to put in to develop a parser or just use one of the existing JSON parsers that basically available for any programming languages.

When you implement the parser, there might be some languages that are easier to parse and some more difficult. So you may choose to implement you own parser (if you do not want to depended on other peoples' code) or to use someone's parser (if you are confident it will remain stable and reliable, and you do not want to roll out your own thing). I see this as a usual trade-off you make each time you pick a library, as system or a language.

Now, if you introduce yet another filer representation, you will have two parsers to soport instead of just one, two dependencies, two peaces of code in RAM, etc... What advantage does this give to you?

That's the point in the case of JSON I don't need to implement a parser or a string generator on the client-side. I think it is not realistic that somebody how wants to join would start with the implementation of any parsers.

fekad commented 4 years ago

Thank for all the comment @sauliusg, @rartino, @mkhorton @ml-evs @CasperWA @merkys! I try to conclude my thought: I would say that it is only worth to consider to introduce a new representation if it is highly beneficial. If we would just use it as an internal representation I would just drop the issue immediately. If we want to use as an interchange format through the internet I'm convinced that the JSON is best and only realistic chose.

Don't get me wrong I think the string is as important as the JSON format but they have a different purpose: as a human, it is easier to create a string but as a client and for the developers it is way more convenient to use JSON.

sauliusg commented 4 years ago

That's a good question. I would say each representation has its own audience (human vs machine). Currently, it is not a big issue because there are only a few clients right now but it might change in the near future. So I would say both of them should be labelled as MAY and over time we will see which is the most preferred representation.

I'm afraid marking both current infix filter notation and whatever alternative is rolled out as MAY defeats the whole purpose of having OPTIMADE... This would essentially mean that every client must implement both query formats, and a code to find out which is supported! Sounds like an awful mess and complication out of nothing.

What is actually the problem we would try to solve introducing all this complexity?

Currently, filter support is a MUST feature (within the minimal supported filter semantics), and I can easily make queries which are expected to work with any OPTIMADE server. Why would we want to abandon this compatibility?

sauliusg commented 4 years ago

Although the grammar precisely defines how the string can be interpreted, it doesn't tell you anything about its meaning.

Well, in my view "how the string can be interpreted" == "meaning", so the above statement sounds to me as contradiction.

Grammar, indeed, described the syntax of the strings (i.e. how they look like), but not their meaning (semantics, i.e. how they are to be interpreted). For the semantics, we have the rest of the OPTIMADE spec that hopefully does the job, even if in a less formal way.

Each "keyword" (AND NOT, HAS, etc ) has its own definition separately in the specification.

Yes.

I know it is not that black and white and the only thing that I wanted to say that you can have a gammar+string or JSON definition and on top of this you still need to define the meaning of their content.

Yes indeed. JSON representation does exactly the same as the filter representation, just uses the different syntax (grammar). So what's the point in duplicating?

sauliusg commented 4 years ago

If we want to use as an interchange format through the internet I'm convinced that the JSON is best and only realistic chose.

Some 20 years ago people would have probably said 'XML is the best and only realistic choice'. And before that s/XML/ASN.1/g.

My point is that JSON is one of the many available serialization formats. There is nothing inherent in JSON that would make it any superior (or inferior, for that matter) to other formats. JSON is, no doubt, good for the purposes we try to use it and fairly well designed, but so is XML. And so is CIF.

A well designed protocol is carrier-format neutral, and REST style, as far as I understood from R. Fielding's texts, strives to be neutral and versatile in this respect. So should we.

sauliusg commented 4 years ago

The addition of POST methods, especially with a focus on a well-defined JSON-query structure to send, would make it very easy to implement a GraphQL version of the OPTIMADE spec. It would almost effectively already be there. Supporting this parameter in GET requests make sense as well to me. However, the POST method would need to be supported for use with GraphQL (as far as I understand).

I would thing that just adding POST is not enough, you need a back-end that handles GraphQL?

sauliusg commented 4 years ago

I think it is very relevant to see how much effort you need to put in to develop a parser or just use one of the existing JSON parsers that basically available for any programming languages.

I thought you did not want to depend on someone else's (JSON) parser? ;)

sauliusg commented 4 years ago

This is one of the main points here, it think. If you use JSON you do not need to implement any parser you just need to interpret its content.

Well, what does it mean 'you just need to interpret its content'? I would argue it is nothing else than parsing with (someone else's) parser:

use json_parser;
tree = json_parser.parse_json(filter);
# you get a tree representation of  the filter

How is it different from:

use filter_parser;
tree = filter_parser.parse_filter(filter);
# you get a tree representation of  the filter

And in both cases you need to interpret the tree semantics...

So, for the client programmer, what is the difference?

sauliusg commented 4 years ago

he client is forced to construct a difficult string (like: list1:list2:... HAS ONLY val1:val2:...) which is far from trivial. The dictionary data structure allows you to incrementally build up an object that you can pass to the server.

I do not buy this argument. What prevents you to build a tree incrementally, and then calling a method to serialise it? Viz.:

tree = new AST();
tree.add(list1).add(list2);
filter = tree.serialise_as_filter();

for JSON, you would presumably do:

tree = new AST();
tree.add(list1).add(list2);
filter = tree.serialise_as_json();

Big deal?

sauliusg commented 4 years ago

TL;DR – its a long rant on filters, but if you are interested in my rationale behind the current design and the views on other variants, then please read on... :)

I would say each representation has its own audience (human vs machine). Currently, it is not a big issue because there are only a few clients right now but it might change in the near future. So I would say both of them should be labelled as MAY and over time we will see which is the most preferred representation.

Maybe I should clarify a bit the intent of the current Filter spec. Back at the beginning of the OPTIMADE, we were thinking about the classical REST interface that would be a) concise to fir the query string; b) human-readable to allow queries to be constructed without any special software c) easy (for computer scientists ;) to parse. After quite a deliberation we arrived at the current filter language. It has the following desired features:

a) it is specified in a standard EBNF notation, not tied to any parser or parser generator;

b) since EBNF is described in EBNF itself, a parser can be constructed for it; and then another parser for the EBNF-described grammar can be generated for different parser generators such as Yacc, Bison, Larc, Grammatica, ANTLR, etc... (for testing the spec, but probably not for production implementation);

c) it is fairly human-readable and hopefully familiar (it uses the usual mathematical infix notation)

d) it is concise and fits QS;

e) it is unambiguous in parsing and (hopefully) in semantics; thus you can know for sure what a>0 AND (b=1 OR c HAS "He") means;

The parse trees produced by various parsers are supposed to be a matter of implementation, and there can be many different trees representing the same string, especially since many different data structures can be used by the backend. The very explicit intention is not to tie the filter specification to any particular parser generator, implementation or internal representation. It should be equally easy for all backends to parse and use.

What you are essentially suggesting is that we now design a completely new serialisation representing queries. You suggest using JSON as an vehicle for such serialisation.

I do see a point of re-using the existing JSON parsers to parse a filter request; it will let you get away with using just one parser in your code instead of two, and the JSON parser are more wide-spread at the moment. But this is IMHO a quite weak benefit; memory is cheap, and we are using interpreted languages that are very inefficient in this respect anyway. Parsers for filters are simple and are either implemented or on the way. Reusing just a JSON parser will not solve your problem; you will have standardize the JSON request format, and then (since you can not make sure that it fits your internal representation) you will have to transform it to a representation your libraries use.

So while the benefits of JSON do exist, IMHO they are small, and if we adopt JSON we lose (c) and (d), which is IMHO a bigger problem. And having two equivalent specifications seems to me like an overkill.

If we have opted to have explicit tree representation from the very beginning, we could have done that, but then the filter language would not be needed. Instead you should have specified the tree structure, and would have picked different grammar. And then I personally would have gone for S-expressions, as a simpler alternative, and not for JSON. In that case the filter grammar would not be necessary, but instead we would have to specify a fully parenthesised prefix form. Or a reverse Polish postfix form. Or any other tree representation you wish. But since we have already decided for the infix notation, why should we changed it? I guess we never arrive at a finished spec if we change basic principles each time a new implementation arrives, just for the convenience of that new implementation. The whole purpose of the spec and filter grammar is that it is implementation independent.

As for the argument that "people are familiar with only/mostly with JSON" I do not buy it. It only applies to very beginners, and only at this particular point in the history. Some years later people will only be familiar with YAML. Or with CSV. Or with whatever will be "cool" fad a that time. But there are many protocols/languages/formats out there, and I think we the OPTIMADE developers :) should pick the one that has most benefits, not the one that happens to be most familiar.

What should be present is reliable parser libraries for most programming systems in use: A(dd your favourite), C, C++, Java, JS, Perl, Python (in alphabetic order :)

Dumping a parse tree is lucrative, and we have done this ourselves (Andrius has written, for instance cif2json; see the example output 2200000.json ;), but this is not a standardized way to exchange data. For example, cif2json produced a JSON that closely matches out internal representation, but has nothing to do with OPTIMADE. I have noticed that internal parse trees that are convenient for one purpose are awkward for another, and thus need to be transformed before use; the transformation seems to be more work than parsing (parsing theory is well-established and automatic parser generators are know; not so for tree transformation and semantics, there a lot of things are to be done manually). Thus, offering an internal tree representations sounds to me as going to the lower level of representation (lower level in in term of computer science, less abstract) and therefore exposes "guts" of the processing system. This is dangerous because next time you need to change your system you are stuck.

I am not sure at the moment of an AST will be unambiguous enough match for the filter string. On one hand, since our grammar should be unambiguous, there must exist a parse tree that matches the production sequence used for deriving/analysing the filter string. On the other hand, AST removes some syntactic elements (e.g. parentheses), and this may give different resulting trees. For instance, is a AND b represented by (AND a b), (Expression (AND a b)) or (Expression (Operator AND) (Variable a)(Variable b))? Are these forms equivalent? When the defining grammar will change by reorganising productions (but producing the same filter language) is you AST representation supposed to change or not? Can you assume associativity of AND and produce a tree (AND a b c) from a AND b AND c? Internally, I can transform (AND a (AND b c)) into (AND a b c) since I know AND is associative (in my implementation), and I know my backend generator will understand (AND a b c); but what about yours? Will it choke on multiple-argument AND? Do we want to start specifying such things? Will this prevent other efficient implementations that would be otherwise perfectly correct?

fekad commented 4 years ago

Although the grammar precisely defines how the string can be interpreted, it doesn't tell you anything about its meaning.

Well, in my view "how the string can be interpreted" == "meaning", so the above statement sounds to me as contradiction.

Grammar, indeed, described the syntax of the strings (i.e. how they look like), but not their meaning (semantics, i.e. how they are to be interpreted). For the semantics, we have the rest of the OPTIMADE spec that hopefully does the job, even if in a less formal way.

Each "keyword" (AND NOT, HAS, etc ) has its own definition separately in the specification.

Yes.

I know it is not that black and white and the only thing that I wanted to say that you can have a gammar+string or JSON definition and on top of this you still need to define the meaning of their content.

Yes indeed. JSON representation does exactly the same as the filter representation, just uses the different syntax (grammar). So what's the point in duplicating?

I will try to comment on technical questions one-by-one but I will conclude my thought about the philosophical part in a single one.

I think we completely agree that the parsed strings and parsed JSONs contain the same level of information.

fekad commented 4 years ago

The addition of POST methods, especially with a focus on a well-defined JSON-query structure to send, would make it very easy to implement a GraphQL version of the OPTIMADE spec. It would almost effectively already be there. Supporting this parameter in GET requests make sense as well to me. However, the POST method would need to be supported for use with GraphQL (as far as I understand).

I would thing that just adding POST is not enough, you need a back-end that handles GraphQL?

I didn't comment on this before because it is a completely different issue. GraphQL is a possible alternative for the REST API. The queries language of the GraphQL are not even valid JSON object (The response is JSON but the query format isn't). There is a separate issue about it #48 and I think it would make sense to continue the QraphQl related discussions there.

fekad commented 4 years ago

I think it is very relevant to see how much effort you need to put in to develop a parser or just use one of the existing JSON parsers that basically available for any programming languages.

I thought you did not want to depend on someone else's (JSON) parser? ;)

I think there is a huge difference between depending on a code which was developed at the global scale (like for JSON) and OPTIMADE specific grammar parser codes which are developed by very talented and clever but only a handful amount of people.

Personally I have no problem using JSON parsers which used by millions and very well maintained (and in many cases, there is a high chance that it is natively supported by your chose of programming language).

On the other hand, depending on an OPTIMADE specific grammar parser code is a different situation. You either have to wait for feature releases and bug fixes etc. or you have to develop your own.

fekad commented 4 years ago

This is one of the main points here, it think. If you use JSON you do not need to implement any parser you just need to interpret its content.

Well, what does it mean 'you just need to interpret its content'? I would argue it is nothing else than parsing with (someone else's) parser:

use json_parser;
tree = json_parser.parse_json(filter);
# you get a tree representation of  the filter

How is it different from:

use filter_parser;
tree = filter_parser.parse_filter(filter);
# you get a tree representation of  the filter

And in both cases you need to interpret the tree semantics...

So, for the client programmer, what is the difference?

The only difference is the time and effort that you need to spend on to get exactly the same result.

fekad commented 4 years ago

he client is forced to construct a difficult string (like: list1:list2:... HAS ONLY val1:val2:...) which is far from trivial. The dictionary data structure allows you to incrementally build up an object that you can pass to the server.

I do not buy this argument. What prevents you to build a tree incrementally, and then calling a method to serialise it? Viz.:

tree = new AST();
tree.add(list1).add(list2);
filter = tree.serialise_as_filter();

for JSON, you would presumably do:

tree = new AST();
tree.add(list1).add(list2);
filter = tree.serialise_as_json();

Big deal?

In many programming languages, there is native support for dictionary-like objects so you do not even need to implement your own AST object. Also in many cases, you do not need to implement your own steriliser either because it is already available for you. Personally I think it is important but I don't want to convince anybody here, I'm just saying that usually there is a few factors of difference between the two solutions in terms of time and effort.

fekad commented 4 years ago

TL;DR – its a long rant on filters, but if you are interested in my rationale behind the current design and the views on other variants, then please read on... :)

I would say each representation has its own audience (human vs machine). Currently, it is not a big issue because there are only a few clients right now but it might change in the near future. So I would say both of them should be labelled as MAY and over time we will see which is the most preferred representation.

Maybe I should clarify a bit the intent of the current Filter spec. Back at the beginning of the OPTIMADE, we were thinking about the classical REST interface that would be a) concise to fir the query string; b) human-readable to allow queries to be constructed without any special software c) easy (for computer scientists ;) to parse. After quite a deliberation we arrived at the current filter language. It has the following desired features:

a) it is specified in a standard EBNF notation, not tied to any parser or parser generator;

b) since EBNF is described in EBNF itself, a parser can be constructed for it; and then another parser for the EBNF-described grammar can be generated for different parser generators such as Yacc, Bison, Larc, Grammatica, ANTLR, etc... (for testing the spec, but probably not for production implementation);

c) it is fairly human-readable and hopefully familiar (it uses the usual mathematical infix notation)

d) it is concise and fits QS;

e) it is unambiguous in parsing and (hopefully) in semantics; thus you can know for sure what a>0 AND (b=1 OR c HAS "He") means;

The parse trees produced by various parsers are supposed to be a matter of implementation, and there can be many different trees representing the same string, especially since many different data structures can be used by the backend. The very explicit intention is not to tie the filter specification to any particular parser generator, implementation or internal representation. It should be equally easy for all backends to parse and use.

What you are essentially suggesting is that we now design a completely new serialisation representing queries. You suggest using JSON as an vehicle for such serialisation.

I do see a point of re-using the existing JSON parsers to parse a filter request; it will let you get away with using just one parser in your code instead of two, and the JSON parser are more wide-spread at the moment. But this is IMHO a quite weak benefit; memory is cheap, and we are using interpreted languages that are very inefficient in this respect anyway. Parsers for filters are simple and are either implemented or on the way. Reusing just a JSON parser will not solve your problem; you will have standardize the JSON request format, and then (since you can not make sure that it fits your internal representation) you will have to transform it to a representation your libraries use.

So while the benefits of JSON do exist, IMHO they are small, and if we adopt JSON we lose (c) and (d), which is IMHO a bigger problem. And having two equivalent specifications seems to me like an overkill.

If we have opted to have explicit tree representation from the very beginning, we could have done that, but then the filter language would not be needed. Instead you should have specified the tree structure, and would have picked different grammar. And then I personally would have gone for S-expressions, as a simpler alternative, and not for JSON. In that case the filter grammar would not be necessary, but instead we would have to specify a fully parenthesised prefix form. Or a reverse Polish postfix form. Or any other tree representation you wish. But since we have already decided for the infix notation, why should we changed it? I guess we never arrive at a finished spec if we change basic principles each time a new implementation arrives, just for the convenience of that new implementation. The whole purpose of the spec and filter grammar is that it is implementation independent.

As for the argument that "people are familiar with only/mostly with JSON" I do not buy it. It only applies to very beginners, and only at this particular point in the history. Some years later people will only be familiar with YAML. Or with CSV. Or with whatever will be "cool" fad a that time. But there are many protocols/languages/formats out there, and I think we the OPTIMADE developers :) should pick the one that has most benefits, not the one that happens to be most familiar.

What should be present is reliable parser libraries for most programming systems in use: A(dd your favourite), C, C++, Java, JS, Perl, Python (in alphabetic order :)

Dumping a parse tree is lucrative, and we have done this ourselves (Andrius has written, for instance cif2json; see the example output 2200000.json ;), but this is not a standardized way to exchange data. For example, cif2json produced a JSON that closely matches out internal representation, but has nothing to do with OPTIMADE. I have noticed that internal parse trees that are convenient for one purpose are awkward for another, and thus need to be transformed before use; the transformation seems to be more work than parsing (parsing theory is well-established and automatic parser generators are know; not so for tree transformation and semantics, there a lot of things are to be done manually). Thus, offering an internal tree representations sounds to me as going to the lower level of representation (lower level in in term of computer science, less abstract) and therefore exposes "guts" of the processing system. This is dangerous because next time you need to change your system you are stuck.

I am not sure at the moment of an AST will be unambiguous enough match for the filter string. On one hand, since our grammar should be unambiguous, there must exist a parse tree that matches the production sequence used for deriving/analysing the filter string. On the other hand, AST removes some syntactic elements (e.g. parentheses), and this may give different resulting trees. For instance, is a AND b represented by (AND a b), (Expression (AND a b)) or (Expression (Operator AND) (Variable a)(Variable b))? Are these forms equivalent? When the defining grammar will change by reorganising productions (but producing the same filter language) is you AST representation supposed to change or not? Can you assume associativity of AND and produce a tree (AND a b c) from a AND b AND c? Internally, I can transform (AND a (AND b c)) into (AND a b c) since I know AND is associative (in my implementation), and I know my backend generator will understand (AND a b c); but what about yours? Will it choke on multiple-argument AND? Do we want to start specifying such things? Will this prevent other efficient implementations that would be otherwise perfectly correct?

I can see @sauliusg 's points and I almost completely agree with the design choices of the filter language.

The main purpose of this issue is: it would make sense to allow the usage of JSON format for queries, especially if we are already using JSON for basically everything else (OPENAPI, JSON schema for the responses, representing the results) right now.

Unfortunately, I cannot see any chance to reach any conclusion because this is about two formats which have different purposes, pros and cons.

The only suggestion that I can think of is that we could create something like official/unofficial "developers notes" when we can list or agree on some best practices and experimental features like this. If any of these ideas turn out to be useful than it can be transferred into the specification. I know PRs roughly have the same purpose but I'm afraid some cases the conversation go towards philosophical questions and any possibility to agree on the technical/practical questions is completely blocked.

My opinion is that all of the issues mentioned above can be resolved easily (the precise meaning of the content of the JSON object, ambiguity, precedence). There are many working examples out there (MongoDB, ElasticSearch, etc.) from which we can learn.

The proposed content related to this issue would be something like this:

sauliusg commented 4 years ago

The proposed content related to this issue would be something like this:

* The providers can accept JSON queries only by POST method

  * It is not realistic that anybody wants to build a JSON query by hand

That's not true (I routinely have to do this for testing), but hardly relevant.

  * If we allow the usage of the POST methods its format has to specified and implemented anyway as a JSON so adding an extra attribute (like filter_json - but its name can be decided later :) ) wouldn't be an issue

* The specification of the meaning of the JSON structure.

I see good technically valid points in your suggestion to have queries posted in JSON by POST.

What bothers me is the following:

So, if you would develop your own API and settle on the JSON POST as the only mechanism for submitting queries, I would say its fine. But adding such mechanism on top of already complex specification like OTIMADE seems to me of dubious value – rather a drawback.

We should try to keep things conceptually simple and not overload them with duplication features. I would personally start thinking what we can remove from the OPTIMADE spec, not what we can add. :)

sauliusg commented 4 years ago

GraphQL is a possible alternative for the REST API. The queries language of the GraphQL are not even valid JSON object

Just to make sure that we are not having misunderstandings here:

sauliusg commented 4 years ago

There are many working examples out there (MongoDB, ElasticSearch, etc.) from which we can learn.

Maybe we should also learn from SOAP as well, not to be biased?

sauliusg commented 4 years ago

In many programming languages, there is native support for dictionary-like objects so you do not even need to implement your own AST object.

Of course you don't. 'new AST' was an abstract example – replace it with any object that does the job on your system.

The key feature here is 'serialise_as_filter'. You have to write it once. Yet it is such a simple tree traversal that I never assumed this is such a big problem. I would say it would take less time to write a traversal code than all the posts we have written in this thread... And then you write it once and forgret. After the 1.0 release, the grammar is not supposed to change (and even if it changes, it will most probably change the tree, not the tree travesesal code!).

fekad commented 4 years ago

The proposed content related to this issue would be something like this:

* The providers can accept JSON queries only by POST method

  * It is not realistic that anybody wants to build a JSON query by hand

That's not true (I routinely have to do this for testing), but hardly relevant.

The only thing that I wanted to say here that usually the humans will use string format with GET method and clients json format with POST method.

  * If we allow the usage of the POST methods its format has to specified and implemented anyway as a JSON so adding an extra attribute (like filter_json - but its name can be decided later :) ) wouldn't be an issue

* The specification of the meaning of the JSON structure.

I see good technically valid points in your suggestion to have queries posted in JSON by POST.

What bothers me is the following:

  • the mechanism essentially duplicates the existing filter mechanism in the specification; suct duplication is always bad in most respects; the gain is some (small) convenience for programmers (which I would even doubt if we have it on our backend, rather the opposite if we have to implement yet another query mechanism);
  • the development of the spec is not a small work, and not, as you say "issues mentioned above can be resolved easily" – it took 4 years to agree on the current spec, and I do not see how JSON queries will be any easier.

So, if you would develop your own API and settle on the JSON POST as the only mechanism for submitting queries, I would say its fine. But adding such mechanism on top of already complex specification like OTIMADE seems to me of dubious value – rather a drawback.

We should try to keep things conceptually simple and not overload them with duplication features. I would personally start thinking what we can remove from the OPTIMADE spec, not what we can add. :)

Sorry, I never wanted to question the work which was done on the query language and actually I trying to say the opposite. All I'm saying if we agree on the structure of the JSON we can one-to-one map the current logic, so we can reuse everything. I only thing that meant to say that this one-to-one mapping is an easy task.

Nevertheless, I do not want to push this issue any further - there was no interest about it during the workshop either -, so if you have any other comment please do it but feel free to close it anytime I'm completely happy with that too.

fekad commented 4 years ago

This issue will be closed because currently there is no need for any another format to represent a query. A JSON format would cause more difficulty/complexity than its benefit.