use tokens before parsing

jaheba commented 10 years ago

I have digged a bit deeper into the code and have noted some issues, which are in my opinion are due to the fact, that the input string is not tokenised before. I have started to write one, which handles the input quite well. I hope I find time this or next week to port it into the sql2ast function.

Things which cause problems:

Select is not valid, only SELECT or select
ONand AS always needs to be written in capital letters
SELECT .. x = "some string"; .... "other string" results into ignoring the ; as a seperator between the queries

I think there are more issues with the current way the parser works, that are those I just found.

dsferruzza commented 10 years ago

Hi!

You are right: sql2ast implementation is really bad. At the time I wrote it, I didn't know much about parsers. I read some theory when I started to think about parsing conditions, so this part of the project is more reliable.

I was thinking to reimplement sql2ast from scratch. ast2sql also needs some cleanup, but it is easier.

As I don't have so much free time this month, I wasn't going to start this work before March. But if you need it sooner, we can try to work together! What do you think?

jaheba commented 10 years ago

The project is perfectly fine for the prototype I'm currently working on. At least for the moment. I actually don't know that much about parser either, but I think we can get the parser easily to a state, where it is more robust. Most of the logic is there, just not that reliable.

But in my opinion the question is, what should be the target we aim for. For example at the moment the following is true sql2ast('SELECT * FROM Table') == sql2ast('FROM Table Select *'). And I'm not sure if this is a bug or a feature. In the prototype I'm working on, we use partial SQL statements like WHERE x=3 and merge them with an existing sql-ast. If we would enforce too much in the parser, we could loose stuff like this.

So I'm not in a hurry at all and can hack in stuff if I need too, albeit I think, having a nice, handy and robust parser would be really nice. I am also quite busy this month, but march looks good for me (at least at the moment).

FYI, my current state of the lexer (just a quick implementation, but it works quite nice): https://gist.github.com/jaheba/8957181

dsferruzza commented 10 years ago

I agree: it works well for a prototype. I'm also starting to see the limits, so it definitely needs to be rewritten.

The feature sql2ast('SELECT * FROM Table') == sql2ast('FROM Table Select *') is a side effect because of the current stupid implementation of the parser. I'm not trying to make it parse that kind of invalid SQL, but I need it to be able to parse SQL while someone is typing.

Regarding to my needs, I don't care if the parser discards some parts of a SQL query it doesn't understand if it can parse the rest. Also, I only need not too complex CRUD queries support (I don't care about CREATE TABLE, for instance). Do you have some extra or more specific needs?

I read your lexer; it gives me a lot of ideas/questions, so it's a good start! For example: we should use higher level functions like Array.map, Array.filter, Array.reduce, ... when possible. That's something I did a little with the current parser, but not enough. These functions are nice because they have no side effect (they are massively used in functional programming).

I'm not an expert at writing parsers, but I read the first 3 lessons of this course: https://github.com/DmitrySoshnikov/Essentials-of-interpretation/ It is really interesting, so I recommend it!

Are you familiar with automated tests or Test Driven Development? Here I'm using QUnit to test if my parser gives the expected result. I'm using it through Grunt (with JSHint to check my JS syntax). I'm asking this because this workflow really helped the current parser to be more reliable. So I think it is something to keep.

Final question and I'll stop for now to let you answer: what do you think about the structure of the AST generated by the parser? (I think this is something crucial.)

jaheba commented 10 years ago

I'm not trying to make it parse that kind of invalid SQL, but I need it to be able to parse SQL while someone is typing.

Same here. But I think an option to have the query validated, would still be good. We want a live preview of the query, so filtering some invalid queries would help us, not sending them all to the database server.

My lexer is an approach to capture all tokens of sql, and I'm not sure, how viable it is in the end. However I tested some large sql queries I had and it did pretty ok.

Besides I am a fan of higher level functions, too; but I think one should not use them on self purpose.

I read some of the lessons and recognised the style you used for the conditions parser. I don't agree completely how the code is organised, but the rest was indeed interesting.

I am familiar with TDD, but not in Javascript (coming mainly from python). So I have a look into it.

The AST generated looks good for me and I like the JSON style.

dsferruzza commented 10 years ago

Today, I started to re-think the AST. The AST from the current v1 is not so bad, but it doesn't have a fixed structure. I mean the parser doesn't guarantee that a given field will exist at the root of the AST, you have to check.

So I did a quick example here: https://gist.github.com/dsferruzza/9529264 Main changes:

there is always a type field at the root level of the AST (which can be select, insert, ...)
according to its value, you can predict what fields will be at the root level of the AST
the AST contains raw expressions (full or partial) which can be useful when you don't need too much details about the query

What do you think of that? I will soon create a v2 branch to start working on it!

jaheba commented 10 years ago

It is over two month ago, since I last posted, but at least I made some progress.

There are basically two approaches to write a parser, either top-down, or bottom-up. Top-down is the "natural" approach, where you read tokens and then decide which rule to apply. That means, that for each token you know what has to be done next. Top-down parser can be written by hand, and the most common strategy is to use a LL(1) recursive decent parser. The problem you have, when using those is that you have to eliminate left-recursion from your grammar. The other big family of parser are bottom-up parser which are mostly implemented through LR(k) parser. The k stands for the number of lookaheads. These parser try to reduce a set of terminals to non-terminals, and then reduce them further. They are more powerful than LL-parser, however they use really huge tables, which are automatically generated and thus you can't really write one by hand.

The parser I have written (some sort of LL(1) parser) is far from being finished, but can parse some sql and supports stuff like subselects or nested brackets distinguishing between logical and mathematical operators. If you like I can share the current state.

I like the structure of your ast, however it is difficult to keep/generate the expression nodes and I don't really the advantage, since you can apply the to_s function to the specific node, to get the same result. I currently use a very simple structure, where each node only has a type-attribute and the rest is up to the specific node. But I'm quite receptive to change it for a more elaborated one.

dsferruzza commented 10 years ago

I'm happy to read from you!

About writing the parser, I am trying to choose between 3 solutions:

write a top-down parser by hand
use PEG.js (http://pegjs.majda.cz/)
use a parser combinator (like https://github.com/jayferd/parsimmon/)

So the next step for me is try to make a proof of concept with a parser combinator. It is an interesting solution because I wouldn't have to write everything by hand but it would still remain 100% pure JS. I'll make a Gist if I can make something cool and you are interested to see it.

About keeping expressions in nodes, the point was to let the user choose if he wants a fully parsed output, or just some pieces of strings. I can be useful if you build a UI on top of the parser and want to display some parts of the query as they are. But I agree, this is not very useful and can be complicated to implement. So I think I'll forget it for now, and reconsider it when we have something working.

Do not hesitate to share your work, I'll have a look!

dsferruzza commented 10 years ago

I pushed a new branch: https://github.com/dsferruzza/simpleSqlParser/tree/v2 It uses a parser combinator: https://github.com/jayferd/parsimmon/ There are still lots of problems to solve, but I hope to catch up v1's features.

dsferruzza commented 10 years ago

v2 is going well! I think I will release it soon! Did you have a look, @jaheba ? https://github.com/dsferruzza/simpleSqlParser/tree/v2

jaheba commented 10 years ago

What do I have to do to get require.js and Parsimon running?

dsferruzza commented 10 years ago

I don't really know require.js, so I can't tell for now... simpleSqlParser v2 just need an object called Parsimmon to be in the global scope, or it will require it. If you have specific needs, maybe we can find a more general solution :)

Also, maybe I can distribute a variant with an embedded version of Parsimmon. As it is the only dependency, it could make sense...

jaheba commented 10 years ago

I have finally found some time to test it out a bit and it looks good so far :) I'm not too much into parser generators but I like the approach.

However I think it is important to specify which dialect of sql you are supporting. Some DBMS support SCHEMAS additionally to TABLES and COLUMNS. HANA uses double quotes instead of backticks for identifier-strings. And I believe there are tons of stuff like that for every dialect. What a beautiful world we live in.

Do you plan to support expressions as ast-nodes?

dsferruzza commented 10 years ago

At first I was skeptical about parser combinators, but after trying (and having written a top-down parser by hand in the past) I really like the ease it offers to write parsers and think about them.

Parser generators offer similar mechanisms to build parsers, but I prefer (at least for now) parser combinators as they don't require a specific syntax (everything is valid JS) nor a special build step. But I guess it's a trade-off about performance/maintainability/ease of use...

About specifying which dialect of SQL is supported: I agree that it is important. But I don't know if I will have the time/motivation to do that... For now, the policy is:

insist on the fact that this is not an exhaustive SQL parser
use tests to give users some guarantees on what is supposed to be supported
grow according to my needs
let users that need more features open issues/PR

But I'm really open to discussion if you need more, have suggestions, or want to help!

Do you plan to support expressions as ast-nodes?

I'm not sure to understand the question... If you are talking about expression fields that are produced in the AST, they are here to:

offer a more global detail level (depending on what you want to do with the AST, it might be overkill and annoying to have a too much detailed AST)
make it more easy to re-create the SQL query from the AST (see ast2sql())

In some AST nodes, I only implemented the first level of parsing. That's why there is only an expression node and nothing else.

dsferruzza / simpleSqlParser

use tokens before parsing #6