arthurhsu / rdb

Draft for W3C Relational Database Proposal
https://arthurhsu.github.io/rdb/
20 stars 0 forks source link

JSON based query format. #7

Open ghost opened 8 years ago

ghost commented 8 years ago

The current spec proposes both a builder pattern as well as a JSON format for creating schema definitions. Do you think it would be possible to create a similar JSON format for queries as well? The builder pattern would then be a more convenient, and safer way to build this JSON description.

This would make it easier to build developer tools that build and inspect queries, as well as easier utilisation of RDB in cases where user generated queries are desired.

Furthermore it would open up RDB to act as a foundation for other relational query languages. For example, I think there would be huge interest in the ClojureScript community to build a Datalog query engine on top of RDB.

Other databases that use this approach include ElasticSearch and Datomic.

Thoughts :)?

arthurhsu commented 8 years ago

That's an interesting idea. Something like this?

var insertStmt = { into: 'Table1', values: ?1 };

var query = db.prepare(insertStmt); query.bind(['1', '2', '3']); query.commit();

arthurhsu commented 8 years ago

Think more about it, the search condition will be problematic.

var selectStmt = { projectionList: ['tablea.name', 'tableb.name'], from: ['tablea', 'tableb'], where: ['tablea.id == abc'] }

This kind of things are a big red X for me. We don't want to evaluate expressions from JSON, which is a big security concern.

ghost commented 8 years ago

Yeah similar to this :). As for security concerns, do you mean the 'tablea.id == abc' statement? I agree that there should be no string evaluation because of injection concerns.

ElasticSearch solves this by using more a complex JSON syntax, similar to this:

var selectStmt = {
select: [{:table:'tablea', column:'name'}, {:table:'tableb', column:'name'}],
from: ['tablea', 'tableb'],
where: [{eq: [{table:'tablea', column:'id'}, {constant:'abc'},
             {leq: [{table:'tableb', column:'some_int'}, {variable:1}]}]}

Where {variable:1} would be bound at .bind() time.

While this makes the syntax more verbose, I don't think it's that much of a problem because users will not necessarily write it by hand but rather through the builder pattern or tools like SQL parsers.

I think the biggest downside might be the performance impact of "parsing" the JSON, when dynamically creating queries, which should be alleviated by .prepare()ing them, as you did in the example above. :)

freshp86 commented 8 years ago

Allowing JSON based schema definition and queries is a bit concerning to me.

It basically creates a "meta-programming" way of doing things with JSON, instead of sticking with one canonical way (programmatic creation of queries using JS and builder pattern). A good analogy is ANT configuration XML files VS Gulp JS-based files. I find the latter much more readable and maintainable.

Once we go down the path of adding a meta-programming way of doing things, the question of why not using SQL syntax directly comes up.

ghost commented 8 years ago

@freshp86

A good analogy is ANT configuration XML files VS Gulp JS-based files. I find the latter much more readable and maintainable. Once we go down the path of adding a meta-programming way of doing things, the question of why not using SQL syntax directly comes up.

Well, the JSON would serve as an intermediary representation, and would thus be edited rarely by users directly. It however opens up the possibility for different relational "frontends", e.g. the builder pattern (the canonical way of the spec), SQL strings (left to the reader as an exercise), nonrecursive datalog with negation (left to the clojurescript community ;) ).

As for why not having SQL strings as this intermediary. A data based format is much more maintainable, programatically manipulatable, and safer than strings.

I know plenty of people that shudder at builder patterns, and love SQL strings, and vice versa. Having a relational system that is flexible enough to give all these people satisfaction seems like a win to me :).

The main reason why I favour the JSON intermediary over the builder pattern at the lowest-level is that most developer tools around RDB will have to define such a data format anyways to do query manipulations, and having many incompatible ad-hoc versions for this seems like a missed opportunity. :)

arthurhsu commented 8 years ago

I think one benefit to call out here is to transpile things across the board. The same JSON can be easily consumed by JavaScript, Java, Swift, C++, C#, or whatever programming language you name it. As long as the behaviors are consistent, it actually is good for modern development that needs to provide cross-platform apps, or do the same schema for both backend and frontend.

freshp86 commented 8 years ago

I acknowledge some benefits of a JSON-syntax like trans-pilation, but I can't convince myself that a Web platform JS API needs to be modified drastically to accommodate other languages/platforms/environments.

It seems that the transpilation concern belongs to a higher layer outside of the RDB spec. Someone can write as many transpilers as needed to go from JSON to RDB, java.sql, plain SQL, Swift, C++, what-have-you, and they are free to use any single-source-of-truth canonical syntax they prefer instead of JSON (YAML?), IF they ever have such a need.

Making RDB's canonical syntax be the JSON seems unnecessary, since it is solving a problem that is outside of the spec's scope IMO. The downsides are more concrete than the non-concrete hypothetical benefits (assuming that people will actually share JSON files across platforms/languages/environments). Putting query/schema logic in a JSON "config file" follows the "code is data" mantra which I have not found appealing.

It also reminds me of The Configuration Complexity Clock article which IMO illustrates nicely how config files degenerate over time to their own domain specific language (DSL in the article).

arthurhsu commented 8 years ago

I think one thing here is to remove the existing JSON schema design. If JSON is desired then it shall be in a more consistent way than what's designed now (i.e. done via the prepare() statement).

ghost commented 8 years ago

I acknowledge some benefits of a JSON-syntax like trans-pilation, but I can't convince myself that a Web platform JS API needs to be modified drastically to accommodate other languages/platforms/environments.

Being able to reuse queries across platforms is just a side benefit, like I said I'm primarily concerned about developer tooling and other relational js libraries.

It seems that the transpilation concern belongs to a higher layer outside of the RDB spec.

The builder pattern has to create some sort of tree internally for the optimiser to run on right? All the JSON syntax does is lift this format higher up to the user. In a sense a tree seems more fundamental to me than a builder pattern.

The downsides are more concrete than the non-concrete hypothetical benefits

Could you elaborate on the downsides you foresee?

(assuming that people will actually share JSON files across platforms/languages/environments).

I'm not arguing for JSON files, I see it as being inlined into applications or generated by JS libraries.

Putting query/schema logic in a JSON "config file" follows the "code is data" mantra which I have not found appealing.

How am I supposed to argue against that ^^.

It also reminds me of The Configuration Complexity Clock article which IMO illustrates nicely how config files degenerate over time to their own domain specific language (DSL in the article).

To me it feels like the builder pattern is actually further down the clock, because it creates a DSL on top of a Data-structure that will be used internally anyways, in order to perform query optimisation. I of course might be wrong in my assumption that such a tree will exist internally.

freshp86 commented 8 years ago

I apologize up-front for the long response. There is a summary at the last paragraph.

Being able to reuse queries across platforms is just a side benefit,

This side benefit comes at a very high maintenance cost. I'll explain this further below.

like I said I'm primarily concerned about developer tooling and other relational js libraries.

Regardless of how the proposed JSON is constructed (by humans or by a developer tool), from the viewpoint of the RDB spec it will be an external facing API, with the caveat that this API does not come in the form of Javascript, instead it comes on a specific DSL form that the JSON must conform (like this example in previous comment).

If JSON was to be accepted by the spec's API it would look something as follows (names are chosen just for illustration purposes)

var jsonQueryObj = .....; // Retrieve a pre-made JSON object from a file or whatever.
var myQuery = db.createQueryFromJSON(jsonQueryObj);

The entire query related API is now stuffed in the jsonQueryObj. I find this very unfortunate, for the following reasons.

  1. There are now 2 ways of declaring a query, the builder pattern and the pre-cooked JSON file. Any future updates to the query API now have to update both mechanisms. Example of such future updates is adding a new aggregator function like geomean, or deprecating an existing function.
  2. Yes, there is a tree representation under the cover, but firstly the JSON example shown previously in this thread has nothing to do with the tree representation of a query, it is just an alternative syntax to the builder pattern.

The readability of the JSON-based vs JS-based is vastly different. Consider the following query

var emp = db.getTable('employee');
db.select(emp.name.as('Full Name'), emp.age, emp.salary, emp.hireDate).
    from(emp).
    where(emp.age.lt(30)).and(emp.hireDate.lt(new Date(3,4,2000))).
    exec();

The above query is as self-documenting as possible, resembles familiar SQL syntax while avoiding any parsing, uses standardized JS objects like Date, a win-win situation overall. I don't know exactly how the equivalent JSON query would look like, but based on previous examples in this thread, it looks as some completely new and verbose syntax that even devs familiar with SQL need to spend time to learn. BTW, how would the JSON indicate that the predicate is a Date? I know is possible, but it will only make the JSON more verbose.

Regarding the argument about exposing the underlying tree representation as an API, as a person who has implemented a query engine in JS from scratch (see Lovefield), this would make forward progress almost impossible. The underlying data structures must be private such that non-breaking changes in the underlying query engine are possible. Also, I don't see how it would be useful in any way other than debugging, which is already possible by calling explain() on a query.

To me it feels like the builder pattern is actually further down the clock, because it creates a DSL on top of a Data-structure that will be used internally anyways

A JS API is not a DSL, or to put it another way, this argument can be made about all JS APIs. Is the following a DSL, or just a builder pattern for arrays?

var myArray = [1,2,3,5,null, 6].
    filter(function(item) { return item !== null; }).
    map(function(item) { return item * item;});

Just to summarize this long post, being concerned about developer tools seems premature and I don't think it should be part of the 1st iteration of the RDB spec. Expanding the API in such a huge way (supporting a specifically formatted JSON or other syntax like YAML) at this point will only decrease the chances of RDB being adopted.

The main objective is to solve a problem that has no adequate solution so far. I am pretty confident in the JS community that if RDB spec picks up momentum there will be a multitude of ideas on how to improve, and supporting tools will start to surface. At that point, it will be much clearer if such tools could benefit by pushing some new APIs/functionality in the RDB spec itself.

ghost commented 8 years ago

That you for this well reasoned post.

While I don't necessarily agree with your DSL definition (to me a builder pattern is always a DSL) I don't think that arguing about semantics there is going to bring us very far ^^.

I also wouldn't be too concerned about a data driven api pinning down the implementation too much, because one can always add a intermediary transformation step, but all the query optimisers I wrote to compile Datalog to SQL were using a nanopass compiler style with many intermediary languages, so I might be too biased to a particular implementation.

I agree RDB adoption is more important. So let's see how that turns out first :)