Closed cam-m closed 8 months ago
Hi @cam-m !
Sorry for answering so late. Nice that you want to use and extend this project.
As stated in the core README we want to use a super-set approach, having all grammars in one grammar: The single dialect will be configured by selecting a subset of validators that will disallow parts of other dialects.
I can point to the Langium docs, the Langium repo, the ANTLR dialects for SQL. Sometimes it can help to have a certain standard, like SQL99. On top, we have a community here.
For the square brackets we can look together if we really need an own grammar or parser. What is the exact meaning of square brackets in SQL Server?
What are actually the parts you want to implement? Do you have any expectations? What should be possible. As the project grows, we can think about approaches how to arrange the code (by dialect or feature or ...).
Best wishes,
Markus
Hi @Lotes, not at all, that was a very quick response in my book :) Glad to hear you're open to my adding some support.
Thanks for the info, I'll digest it all and get back to you if I have questions.
In t-sql (ms sql server) 'square brackets' are tokens used to signify 'quoted identifiers'. Identifiers need to be 'quoted' when they are also keywords or they contain special characters.
MySql supports this using back ticks (which you've implemented in this project), Postgress uses dbl quotes I think. In Sql Server there is also a statement 'QUOTED IDENTIFIERS [ON|OFF]' that enables/disables 'double quote' syntax. The 'square brackets' syntax is always enabled.
I think adding language support would mean re-defining Identifier from:
Identifier returns string:
ID | TICK_STRING
;
to
Identifier returns string:
ID | TICK_STRING | QUOTED_ID | SQ_BRACKET_ID
;
I might also need to look into the completions service - the sql user's I've worked with typically use quoted identifer strings for all identifiers (at least in in sql server). Stands to reason they would want completed identifers quoted by default too. Thoughts?
In terms of features I'm interested in implementing, I work alongside people who use sql 'in anger' daily for complex reporting and data mining so I plan to monitor them for requirements like this. Would you want these kind of feature requests added as work items in this project as they come up, or prefer to discuss case by case? Happy to discuss :)
Thanks again, Cam
Quoted identifiers, just add those redefinition. If there is a collision with strings, we might have to consider to redefine string to be only single quotes, let‘s see…
There is a completion provider you can overwrite. Look at the comments of the default implementation. I never made a custom one, but this is one pointer I can give.
You can add feature requests as you like. Discussions before implementation is a good idea.
Hi @Lotes,
I've been sidetracked for a bit, and getting a chance to look at this again. You're right there is a collision with string literals due to both SQL Server and MySQL allowing double quotes for string literals via options.
Handling double quoted strings as quoted identifiers only would offer the widest dialect support, however this conflicts with the current definition. So as you anticipated I think the parser should only support single quote strings - thoughts?
Summary of review:
Default: Single or Double Quotes allowed. ANSI_QUOTES ON: Singles Quotes only
Default: Backticks ANSI_QUOTES ON: Backticks or Double Quotes
_ANSIQUOTES is OFF by default
Default: Single quotes only. QUOTED_IDENTIFIERS OFF: Single or Double Quotes allowed
Default: Brackets or Double Quotes QUOTED_IDENTIFIERS OFF: Brackets only
_QUOTEDIDENTIFIERS is ON by default.
This one is messy: "Application developers are encouraged to compile using -DSQLITE_DQS=0 in order to disable the double-quoted string literal misfeature by default"
Both single and double quotes (warning emitted if double quoted string is parsed as an literal)
Double quotes, but only if it matched an identifier, otherwise handled as a literal.
Single quotes only
Double Quotes only
Thanks for your detailed research. I think for the moment we can change the grammar to support only single quoted strings. I am not sure about these double quotes, since they are either literals or identifiers. I will ask our Langium guys, what a solution could be applied here.
Do you want to try making a PR? Or shall I make it? :-)
@Lotes No worries, I'll send a pull request
@cam-m I asked my team for the double-quoted strings literals vs identifiers. One strategy to add "identifier support" for double quotes strings is to handle them as string literals and override the LSP services for actions like Go to definition
looking at occurrences of the double quoted string literal.
But let's touch that only when it becomes relevant again. It is an effort to do this.
I would prefer to keep things simple and be pragmatic, choosing only one option - like the one that works for you at the moment.
@Lotes no worries, as long as you're ok about breaking MySQL support - since the default setting interprets "" as a literal and I imagine many implementations will use the defaults. I'll go ahead on this basis.
Thoughts on a long term solution:
I think its generally preferable that ast nodes have a sensible type after parsing. Identifiers are integral to many aspects of language tooling (highlighting, scoping, completion, refactoring) so it might be problematic, or inefficient at least, for services to have to re-interpret string literals as identifiers.
I've solved a similar language problem before (using antlr4 though). This language had modes that determined whether a code line required a leading '#' and the mode could be set externally (i.e. an application config file) or within the code (#SET HashOff | HashOn).
My solution was to use rule guards and reference a custom boolean property on the parser. This property could be set either via a custom parser constructor, or via actions within a rule. The language could then start in either mode based on the constructor, and then dynamically toggle the relevant parser rules on/off according to the mode within a single source file when HashOff | HashOn was parsed.
I don't know langium well enough yet to know whether these features are supported - can custom properties be added to the parser and set via constructor/object creation (being js I assume yes) and can rule guards references parser properties (?)
@cam-m Langium has no properties or macros, which would make this also possible but in a ugly manner. Langium has conditional parameters on rule alternatives. But with no input opportunity from outside the grammar AFAIK. I wrote it as idea to my team, and I might also do in the Langium community later. Your PR looks good so far.
Merged & published
Hi @Lotes,
Thanks for your part in creating this project - I'm considering including it in a project I work on - the current feature set is already very compelling.
I'm keen to add sql-server dialect support and thought it best to raise it here first in case contributing such a feature to this project is possible/preferable. I've been able to easily support sql-server's data types, but it seems syntax specific features (e.g square bracket for quoting keywords/spaces in table/column names) will require a separate / extended parser.
I've not tried the langium lib itself yet but have some experience building editor plugins using antlr4. Any pointers would be much appreciated.