Support for Sql Server dialect

cam-m commented 11 months ago

Hi @Lotes,

Thanks for your part in creating this project - I'm considering including it in a project I work on - the current feature set is already very compelling.

I'm keen to add sql-server dialect support and thought it best to raise it here first in case contributing such a feature to this project is possible/preferable. I've been able to easily support sql-server's data types, but it seems syntax specific features (e.g square bracket for quoting keywords/spaces in table/column names) will require a separate / extended parser.

I've not tried the langium lib itself yet but have some experience building editor plugins using antlr4. Any pointers would be much appreciated.

Lotes commented 11 months ago

Hi @cam-m !

Sorry for answering so late. Nice that you want to use and extend this project.

As stated in the core README we want to use a super-set approach, having all grammars in one grammar: The single dialect will be configured by selecting a subset of validators that will disallow parts of other dialects.

I can point to the Langium docs, the Langium repo, the ANTLR dialects for SQL. Sometimes it can help to have a certain standard, like SQL99. On top, we have a community here.

For the square brackets we can look together if we really need an own grammar or parser. What is the exact meaning of square brackets in SQL Server?

What are actually the parts you want to implement? Do you have any expectations? What should be possible. As the project grows, we can think about approaches how to arrange the code (by dialect or feature or ...).

Best wishes,

Markus

cam-m commented 11 months ago

Hi @Lotes, not at all, that was a very quick response in my book :) Glad to hear you're open to my adding some support.

Thanks for the info, I'll digest it all and get back to you if I have questions.

Square Brackets

In t-sql (ms sql server) 'square brackets' are tokens used to signify 'quoted identifiers'. Identifiers need to be 'quoted' when they are also keywords or they contain special characters.

MySql supports this using back ticks (which you've implemented in this project), Postgress uses dbl quotes I think. In Sql Server there is also a statement 'QUOTED IDENTIFIERS [ON|OFF]' that enables/disables 'double quote' syntax. The 'square brackets' syntax is always enabled.

I think adding language support would mean re-defining Identifier from:

Identifier returns string:
    ID | TICK_STRING
;

to

Identifier returns string:
    ID | TICK_STRING | QUOTED_ID | SQ_BRACKET_ID
;

I might also need to look into the completions service - the sql user's I've worked with typically use quoted identifer strings for all identifiers (at least in in sql server). Stands to reason they would want completed identifers quoted by default too. Thoughts?

In terms of features I'm interested in implementing, I work alongside people who use sql 'in anger' daily for complex reporting and data mining so I plan to monitor them for requirements like this. Would you want these kind of feature requests added as work items in this project as they come up, or prefer to discuss case by case? Happy to discuss :)

Thanks again, Cam

Lotes commented 11 months ago

Quoted identifiers, just add those redefinition. If there is a collision with strings, we might have to consider to redefine string to be only single quotes, let‘s see…

There is a completion provider you can overwrite. Look at the comments of the default implementation. I never made a custom one, but this is one pointer I can give.

You can add feature requests as you like. Discussions before implementation is a good idea.

cam-m commented 9 months ago

Hi @Lotes,

I've been sidetracked for a bit, and getting a chance to look at this again. You're right there is a collision with string literals due to both SQL Server and MySQL allowing double quotes for string literals via options.

Handling double quoted strings as quoted identifiers only would offer the widest dialect support, however this conflicts with the current definition. So as you anticipated I think the parser should only support single quote strings - thoughts?

Summary of review:

MySQL

Literals

Default: Single or Double Quotes allowed. ANSI_QUOTES ON: Singles Quotes only

Quoted Identifiers:

Default: Backticks ANSI_QUOTES ON: Backticks or Double Quotes

_ANSIQUOTES is OFF by default

SQL Server

Literals

Default: Single quotes only. QUOTED_IDENTIFIERS OFF: Single or Double Quotes allowed

Quoted Identifiers:

Default: Brackets or Double Quotes QUOTED_IDENTIFIERS OFF: Brackets only

_QUOTEDIDENTIFIERS is ON by default.

SQLite

This one is messy: "Application developers are encouraged to compile using -DSQLITE_DQS=0 in order to disable the double-quoted string literal misfeature by default"

Literals

Both single and double quotes (warning emitted if double quoted string is parsed as an literal)

Quoted Identifiers

Double quotes, but only if it matched an identifier, otherwise handled as a literal.

Postgres, DB2, Oracle

Literals

Single quotes only

Quoted Identifiers:

Double Quotes only

Lotes commented 9 months ago

Thanks for your detailed research. I think for the moment we can change the grammar to support only single quoted strings. I am not sure about these double quotes, since they are either literals or identifiers. I will ask our Langium guys, what a solution could be applied here.

Do you want to try making a PR? Or shall I make it? :-)

cam-m commented 9 months ago

@Lotes No worries, I'll send a pull request

Lotes commented 9 months ago

@cam-m I asked my team for the double-quoted strings literals vs identifiers. One strategy to add "identifier support" for double quotes strings is to handle them as string literals and override the LSP services for actions like Go to definition looking at occurrences of the double quoted string literal. But let's touch that only when it becomes relevant again. It is an effort to do this.

I would prefer to keep things simple and be pragmatic, choosing only one option - like the one that works for you at the moment.

cam-m commented 9 months ago

@Lotes no worries, as long as you're ok about breaking MySQL support - since the default setting interprets "" as a literal and I imagine many implementations will use the defaults. I'll go ahead on this basis.

Thoughts on a long term solution:

I think its generally preferable that ast nodes have a sensible type after parsing. Identifiers are integral to many aspects of language tooling (highlighting, scoping, completion, refactoring) so it might be problematic, or inefficient at least, for services to have to re-interpret string literals as identifiers.

I've solved a similar language problem before (using antlr4 though). This language had modes that determined whether a code line required a leading '#' and the mode could be set externally (i.e. an application config file) or within the code (#SET HashOff | HashOn).

My solution was to use rule guards and reference a custom boolean property on the parser. This property could be set either via a custom parser constructor, or via actions within a rule. The language could then start in either mode based on the constructor, and then dynamically toggle the relevant parser rules on/off according to the mode within a single source file when HashOff | HashOn was parsed.

I don't know langium well enough yet to know whether these features are supported - can custom properties be added to the parser and set via constructor/object creation (being js I assume yes) and can rule guards references parser properties (?)

Lotes commented 9 months ago

@cam-m Langium has no properties or macros, which would make this also possible but in a ugly manner. Langium has conditional parameters on rule alternatives. But with no input opportunity from outside the grammar AFAIK. I wrote it as idea to my team, and I might also do in the Langium community later. Your PR looks good so far.

Lotes commented 8 months ago

Merged & published

TypeFox / langium-sql