databrickslabs / remorph

Cross-compiler and Data Reconciler into Databricks Lakehouse
Other
36 stars 23 forks source link

[FEATURE]: General processor for keywords #454

Closed jimidle closed 4 months ago

jimidle commented 4 months ago

Is there an existing issue for this?

Category of feature request

Transpile

Problem statement

The ANTLR grammars are full of keywords that are explicitly specified everywhere they can be valid. This results in huge grammars and DFAs, and makes the IR construction way more complicated than it needs to be.

We need a general lookup mechanism for keywords using a Common -> TSQL, Common ->Snowflake, etc. We could also later adorn the table with human readable forms of the keywords and text to use in error messages, but the initial requirement is to validate things like options. The function builder code is a more complex, but similar process.

We currently have many rules that look like this:


xmlIndexOption
    : PAD_INDEX EQ onOff
    | FILLFACTOR EQ INT
    | SORT_IN_TEMPDB EQ onOff
    | IGNORE_DUP_KEY EQ onOff
    | DROP_EXISTING EQ onOff
    | ONLINE EQ (ON (LPAREN lowPriorityLockWait RPAREN)? | OFF)
    | ALLOW_ROW_LOCKS EQ onOff
    | ALLOW_PAGE_LOCKS EQ onOff
    | MAXDOP EQ maxDegreeOfParallelism = INT
    | XML_COMPRESSION EQ onOff
    ;

which means that the AST builders have to jump through hoops to find out what option is what, or label each alt (which generates extra contexts and memory copies for every alt).

override def visitXmlIndexOption(ctx...) ... {
 if (ctx.PAD_INDEX() != NULL {}
 if (ctx.FILLFACTOR() != NULL) {}

Proposed Solution

At it's simplest, we have a table something like this:

Text,         Type,       Value
"PAD_INDEX",  XMLOPTION
"ON",         BOOLEAN,    true
"VALUES",     KEYWORD
...

Though it could also hold its IR definition or anything else that proves useful.

There would be a base/common set of keywords, then a TSQL that inherited from common and a Snowflake that inherited from common. And so on.

This would at least reduce option sets to:

 xmlIndexOption
   : ID EQ expression
   ;

But we can probably merge all options into:

optionSet: option (COMMA option)* ;
option: id EQ expression;

Though note that there are some randomly construed options that we can still specify in patterns but are not always as straight forward as X = Y.

Our Scala code then has logic roughly like this:

xmlOption = optionBuilder.find(ctx.id().getText)
if (xmlOption.type() == XMLOPTION) {
 value = buildOptionValue(ctx.expression.accept(this)
} else {
 ir.UnresolvedOption()
}
...

Additional Context

No response

jimidle commented 4 months ago

Keywords for options are now handled with a parsing rule that allows the visitor to look for options of certain kinds, such as boolean, string, and so on.