Open sundarshankar89 opened 3 months ago
Alternatives for the representation:
comment
to TreeNodeCommentNode
and Comment
case classes"structuring comments" should {
"option a: be part of LogicalPlan node as a field" in {
// This is the _most straightforward_ way to structure comments
ir.Project(namedTable("t"), Seq(ir.Id("c1")), comment = Some("This is a comment")) generates
"""/* This is a comment */
|SELECT c1 FROM t""".stripMargin
}
case class CommentNode(child: ir.LogicalPlan, comment: String) extends ir.UnaryNode {
override def output: Seq[Attribute] = child.output
}
case class ExpressionComment(child: ir.Expression, comment: String) extends ir.UnaryExpression
"option b: introduce a separate node" in {
// This structure doesn't change the framework's behavior
ir.CommentNode(
ir.Project(
namedTable("t"), Seq(ir.Id("c1"))),
"This is a comment") generates
"""/* This is a comment */
|SELECT c1 FROM t""".stripMargin
)
}
"comments on expressions" should {
/**
-- this is a top-level line
SELECT
c1 -- this is a comment
FROM t
*/
"option a: adding a comment to a column" in {
ir.Project(namedTable("t"), Seq(
ir.Id("c1", comment = Some("this is a comment")),
), comment = Some("This is a comment"))
}
"option b: introducing a separate node for the comment" in {
ir.CommentNode(
ir.Project(
namedTable("t"), Seq(
ExpressionComment(
ir.Id("c1"),
"this is a comment")
)),
"This is a comment")
}
}
}
python corner cases:
// SQL:
// -- this is a top-level line
// SELECT
// c1 -- this is a comment
// FROM t
// WHERE foo /* HINT: x > z */ > bar;
// ;
//
// ---->
// Python:
//
// import pyspark.sql.functions as F
// # this is a top-level line
// spark.table("t") \
// .select(
// F.col("c1").alias("c1") # this is a comment
// ) \ # this is a trailing comment
// .filter(
// F.col("foo") > F.col("bar") # HINT: x > z
// ) # this is a trailing comment
and in example IR optimizer rules for tranforming WHERE foo /* HINT: x > z */ > bar;
"
override def apply(expr: ir.Expression): ir.Expression = expr transformUp {
// option A:
case ir.And(left, right, comment = Some(text)) => py.Comment(methodOf(left, "and", Seq(right)), text)
// and we'll probably have to change the invariants for ~200 nodes
// option B:
case ir.And(Comment(left, text), right) => py.Comment(methodOf(left, "and", Seq(right)), text)
case ir.And(left, Comment(right, text)) => py.Comment(methodOf(left, "and", Seq(right)), text)
// option C:
case x @ ir.And(left, right) if x.hasComment => py.Comment(methodOf(left, "and", Seq(right)), text)
// and we'll probably have to change the invariants for ~200 nodes
flowchart TD
antlr --> lib
lib --> commonlex.g4
lib --> grammars
grammars --> snowflake
snowflake --> SnowflakeLexer.g4
snowflake --> SnowflakeParser.g4
snowflake --> basesnowflake.g4
SnowflakeLexer.g4 --> commonlex.g4
SnowflakeLexer.g4 --> basesnowflake.g4
grammars --> tsql
tsql --> TSqlLexer.g4
tsql --> TSqlParser.g4
TSqlLexer.g4 --> commonlex.g4
TSqlLexer.g4 --> basetsql.g4
tsql --> basetsql.g4
Following a number of tentative PRs and discussions, I am proposing to discuss a lightweight spec here (as opposed to a full solution architecture spec which is overkilling for the discussion we need). We have a number of technical constraints:
catalyst
as our IR, and catalyst has no data structure for attaching parsing information Per my various experiments (that work) the following topics need to be discussed. I'll get into details in separate comments.
1 - Origin is not fit for purpose 2 - A parsing location needs to be attached during building of the IR nodes 3 - Our simple parsing pipeline does not allow a second pass 4 - Attaching comments to nodes can be done in various ways
1 - Origin is not fit for purpose
The definition of Origin is as follows:
case class Origin(
line: Option[Int] = None,
startPosition: Option[Int] = None,
startIndex: Option[Int] = None,
stopIndex: Option[Int] = None,
sqlText: Option[String] = None,
objectType: Option[String] = None,
objectName: Option[String] = None)
This structure is not accurate enough:
Moreover, the origin field in TreeNode
is initialized via a singleton - no further comment - which makes it very flaky.
Due to backwards compatibility concerns, we can't refactor Origin
.
Instead, I propose to introduce the following:
case class ParsedLocation(line: Int, column: Int, tokenIndex: Int)
object ParsedLocation {
val empty: ParsedLocation = ParsedLocation(-1, -1, -1)
}
case class ParsedLocationRange(start: ParsedLocation, end: ParsedLocation) {
def isAfterLine(line: Int): Boolean = start.line > line
}
object ParsedLocationRange {
val empty: ParsedLocationRange = ParsedLocationRange(ParsedLocation.empty, ParsedLocation.empty)
}
and a locationRange
field in the TreeNode
class
StartLine is the human interpretation of how many \n there are StartPosition is the position in that line StartIndex is the offset from the very beginning of the text End versions are the same.
I think Origin is fit for purpose information wise for resolving which IR node is deemed closest to a collection of comments. Though we can extract comments from the token stream, I think we could just build a collection of them directly in the common lexer.
We can talk about it next week.
2 - A parsing location needs to be attached during building of the IR nodes
The question arises as to how we attach such location to the node. There are various options:
attach it to a val
field at construction time. This is the perfect time during parsing. It significantly affects existing catalyst code because it requires introducing an optional parseLocation
parameter to the constructor(s) of any class that inherits from TreeNode
. Using it does not require more code than would be required to populate a singleton in a deterministic way. It does not affect performance. Since the field is defaulted, it does not guarantee that it's properly populated. This was prototyped (using Origin) in #1189
attach it to a val
field via a withParsedLocation(newLocation)
transform that returns a copy of the TreeNode
with the new parseLocation
. It does not significantly affect existing catalyst code. This is elegant but requires more code than 1. , and will affect performance (copying is not free). It does not guarantee that the parseLocation
field is properly populated.
make parseLocation
a var
field, attach it via a withParsedLocation(newLocation)
method that returns the same TreeNode
instance with the updated parseLocation
. This requires more code than 1, same amount of code as 2. It goes against the FP style of catalyst, but is more performant than 2. It does not guarantee that the parseLocation
field is properly populated.
automate 3 using helpers that ensure that the parseLocation
field is properly populated. This is prototyped in #1203
I prefer 1 because it augments catalyst with a crucial capability. The lack of guarantees can be mitigated by a simple test that travers the tree and checks that each node has a non-default parseLocation
Maybe something like this is more practical to implement:
Gather a collection of all the comments directly within the lexcommon.g4, then they can be skipped as tokens
We have a CommentManager or CommentProcessor that is given the collection to manage
When any IR node is created, we use the closest ctx or token to create a map of Ir nodes to Origins.
If the optimizer creates new nodes and replaces others, then it needs to update the node map (but then we are going to have to do that anyway I think). nodeMapper.replace( a, b)
nodemapperDelete(a, b) `...
To process comments:
Of course, this not very Catalyst like. But
We can adorn the Ir with Origin as per Valentin's idea as well, though I think as at some point we need to resolve which comment is for which node anyway. I think the above is very simple at least.
3 - Our simple parsing pipeline does not allow a second pass
Context: grammars focus on meaningful content. Whitespace and comments are parsed but sent to hidden channels, thus skipped during matching of grammar rules. Without this, since comments could be literally anywhere in the source code, a rule like the following:
CREATE (OR REPLACE)? (STAGE | FILE FORMAT | SEQUENCE | STREAM | TASK) (IF NOT EXISTS)? dotIdentifier CLONE dotIdentifier
would have to be written as follows:
COMMENT? CREATE COMMENT? (OR COMMENT? REPLACE COMMENT? )? (STAGE | FILE FORMAT | SEQUENCE | STREAM | TASK) COMMENT? (IF COMMENT? NOT COMMENT? EXISTSCOMMENT? )? dotIdentifier COMMENT? CLONE COMMENT? dotIdentifier COMMENT?
Using ANTLR
, you can only have one input stream, and that stream can only listen to 1 channel. Thus, you cannot simultaneously read from the main
and the comments
channel. It needs to be done in 2 passes.
Our current parsing pipeline (in PlanParser
) assumes that all can be done in 1 ANTLR pass, it chains the input stream, the lexer, the token stream, the parser and the plan generator in a way that makes it impossible to access the token stream when building the plan (or once it's built):
def parse(input: Parsing): Transformation[ParserRuleContext] = {
val inputString = CharStreams.fromString(input.source)
val lexer = createLexer(inputString)
val tokenStream = new CommonTokenStream(lexer)
val parser = createParser(tokenStream)
addErrorStrategy(parser)
val errListener = new ProductionErrorCollector(input.source, input.filename)
parser.removeErrorListeners()
parser.addErrorListener(errListener)
update { case _ =>
input
}.flatMap { _ =>
val tree = createTree(parser)
if (errListener.errorCount > 0) {
lift(PartialResult(tree, ParsingErrors(errListener.errors)))
} else {
lift(OkResult(tree))
}
}
}
Beyond the fact that createTree
knows nothing about the tokenStream
(which must be addressed), the above assumes 2 things:
ANTLR
will always be used for parsing (which contradicts our Multiplexer strategy)TokenStreamRewriter
)PRs #1200 and #1201 provide a simpler yet more flexible approach: a concrete PlanParser
is expected to produce a LogicalPlan
from a Parsing
input, regardless of the underlying parsing technology being used.
This somewhat duplicates the pipeline logic if parsers use ANTLR
, but that's a very small cost compared to the flexibility being provided.
We need to opine on this.
3 - Our simple parsing pipeline does not allow a second pass
Context: grammars focus on meaningful content. Whitespace and comments are parsed but sent to hidden channels, thus skipped during matching of grammar rules. Without this, since comments could be literally anywhere in the source code, a rule like the following:
CREATE (OR REPLACE)? (STAGE | FILE FORMAT | SEQUENCE | STREAM | TASK) (IF NOT EXISTS)? dotIdentifier CLONE dotIdentifier
would have to be written as follows:COMMENT? CREATE COMMENT? (OR COMMENT? REPLACE COMMENT? )? (STAGE | FILE FORMAT | SEQUENCE | STREAM | TASK) COMMENT? (IF COMMENT? NOT COMMENT? EXISTSCOMMENT? )? dotIdentifier COMMENT? CLONE COMMENT? dotIdentifier COMMENT?
But I am saying we pick them up in the common lexer then skip them. They won't even be in the token stream.
Using
ANTLR
, you can only have one input stream, and that stream can only listen to 1 channel. Thus, you cannot simultaneously read from themain
and thecomments
channel. It needs to be done in 2 passes.
But now we do not need two passes. But even if we did, we can modify the generic plan parser to take care of that in a simliar way that generate does.
Our current parsing pipeline (in
PlanParser
) assumes that all can be done in 1 ANTLR pass, it chains the input stream, the lexer, the token stream, the parser and the plan generator in a way that makes it impossible to access the token stream when building the plan (or once it's built):def parse(input: Parsing, commentCollector: CommentProcessor /* more likely add to Parsing() */): Transformation[ParserRuleContext] = { val inputString = CharStreams.fromString(input.source) val lexer = createLexer(inputString) val tokenStream = new CommonTokenStream(lexer) val parser = createParser(tokenStream) addErrorStrategy(parser) val errListener = new ProductionErrorCollector(input.source, input.filename) parser.removeErrorListeners() parser.addErrorListener(errListener) update { case _ => input }.flatMap { _ => val tree = createTree(parser)
// Here we create the comment map created by the lexer and put it somewhere
commentMap.load(lexer.comments)
if (errListener.errorCount > 0) { lift(PartialResult(tree, ParsingErrors(errListener.errors))) } else { lift(OkResult(tree)) } }
}
Beyond the fact that `createTree` knows nothing about the `tokenStream` (which must be addressed), the above assumes 2 things:
We don't/should not need the token stream at that point. Ir creation should not rely on the token stream, but use the tokens it needs in the ParserContext. However, the Result can carry information from previous phases should we ned to do that and we can find the token stream should we need it.
ANTLR
will always be used for parsing (which contradicts our Multiplexer strategy)
No - that is not where the replacement converter will go, that is far too low level. There will be some sort of program level interface defined (say yml that says 'her is the input location' 'here is the output location') and a different converter would probably be an entirely different process/engine.
- there will never be a need for a custom pipeline (such as requiring a
TokenStreamRewriter
)
We can insert more phases in PlanParser./
PRs #1200 and #1201 provide a simpler yet more flexible approach: a concrete
PlanParser
is expected to produce aLogicalPlan
from aParsing
input, regardless of the underlying parsing technology being used. This somewhat duplicates the pipeline logic if parsers useANTLR
, but that's a very small cost compared to the flexibility being provided.
I do not agree with this. We are not solving a problem that we have right now and at the moment I cannot see why we would need something like a token stream rewriter.
We need to opine on this.
We are not solving a problem that we have right now
We do have a problem to solve. The fact that the provided solution also solves a future problem is a side-effect but things can't stay as is.
4 - Attaching comments to nodes can be done in various ways
Once comments are discovered, and the node to which they should be attached is located, we need to technically attach them to that node. They need to be attached before (for example line comments that appear before a statement) or after (for example line comments that appear after some code on the same line)
There are at least 3 ways comments could be attached to nodes:
A - using preComments
and postComments
fields. This is straightforward, and experimented in PR #1203. It could be generalized by moving these fields to LogicalPlan
or even TreeNode
B - adding comments as children
. This is tricky for 2 reasons:
LeafNode
are expected to NOT have children
.children are below, not before or after, so we can't just add them as children
C - wrapping nodes in Comment
nodes. The Comment
node would contain the commented node. This seems handy at first glance because it does not require changing the IR model. But it pushes complexity to actual nodes and transformers, that now need to be aware of comments when using or manipulating the tree. For example, let's consider the following existing code:
case class Project(input: LogicalPlan, columns: Seq[Expression]) extends UnaryNode {
override def child: LogicalPlan = input
// TODO: add resolver for Star
override def output: Seq[Attribute] = expressions.map {
case a: Attribute => a
case Alias(child, Id(name, _)) => AttributeReference(name, child.dataType)
case Id(name, _) => AttributeReference(name, UnresolvedType)
case Column(_, Id(name, _)) => AttributeReference(name, UnresolvedType)
case expr: Expression => throw new UnsupportedOperationException(s"cannot convert to attribute: $expr")
}
}
Using Comment
nodes, the above code would need to change significantly since every Expression
could now be a Comment
(that wraps an Expression
)... This will require massive changes
Whatever we decide, nodes are supposedly immutable, so this is not a trivial task. We can either accept that it's ok to have mutable nodes during parsing, or accept that we need to run a full transformation on the entire tree, which affects performance
Overall we are extremely constrained by the software architecture decision to use Catalyst for our IR, and the informal goal of contributing back to it. Catalyst is designed for optimizing execution plans on distributed computing platforms. That is its sole purpose. It addresses a problem we don't have: we only do transpilation on a desktop. To help achieve its original purpose, it enforces the FP programming model which is great for running distributed software on a platform, but does not fit well with ANTLR which is deeply OOP. Moreover, we've forked the source code, but we're only allowed to evolve it marginally, because we have the ambition to contribute back. This impedance mismatch makes it very difficult to perform parsing tasks, at least using ANTLR, and only in order to satisfy the above, forces us to produce massive and convoluted code (see #1203), or avoid using standard ANTLR mechanisms (see https://github.com/databrickslabs/remorph/issues/869#issuecomment-2479371073).
We could at minimal consider alternative options:
Is there an existing issue for this?
Category of feature request
Transpile
Problem statement
Currently the ANTLR Parser moves all comments into Hidden Channel. Need to map them to concrete IR and introduce them at appropriate place during code generation.
Proposed Solution
Resolving comments back to output in a sensible manner is not a trivial task and first we must establish some principles in the source language and target language.
Goal - Generate comments in the translated code and preserve their context wherever possible.
In our case the source and target languages can be considered equivalent as the comments have common syntax, so there are no complications caused by the need to resolve comment structure; for instance if the source allows multi-line in-syntax comments (
/* xxx */
), but the target only allows line comments (-- xxx
).Additionally, as we will pretty print the resulting output, we can ignore trying to preserve the number of spaces and so on that precede or follow a particular token. So long as the correct comment style is placed in the correct place, then the formatter will make it all look good.
Comment style
Using the Databricks documentation for some examples, and adding our own, we can create a sample input that challenges our system such that if the test input is translated correctly, then all input should be translated correctly. The starting point for such a test is in the Additional context section. Note that this is a starting point and has not been thought through exhaustively.
If we study the test we can establish some principles around comment association, knowing that as we are not actually reading the comments, we make guess the associativity incorrectly, but by using these principles, we should get the intended output anyway.
The principles are therefore:
These principles will need to be tested in practice - these kinds of things always have strange edge-cases, where the source author thought they were being mighty clever.
Procedure
We have talked about tokens, but in fact our IR does not preserve tokens, so we need to somehow match our Ir nodes to comments. We could, when building the Ir, map each node we produce with a lookup table. However, as all nodes derive from TreeNode (or should), we can in fact adorn the nodes with the Origin information. Although, the Origin may need to be extended to include stop line and offset - at first glance it may or may not already have enough information to create a span. However, this means that any Ir node we produce must be adorned with the origin information for a token or start stop token, which is slog work but easily achieved. (NB: Don't couple the TreeNodes with ANTLR).
Comments are stored out-of-band by the lexers, on the HIDDEN channel, which is channel 1. Hence they are not consumed by the parser, but are easily scanned by anything else with a reference to the token stream by resetting it after parsing, then asking for all the tokens on the HIDDEN channel. While traversing this HIDDEN channel (or maybe a custom one for comments), we can build a comment map for each input line (note there may be more than one comment on a line.
The approaches here are then to either walk the tree with the comment map and assign the comments as preceding comments to the nearest node. The codegen then just produces any comments before it produces it's own text. We would have:
If we use the map and not adorn the nodes then at code generation, every production needs something like:
Assuming we used the generator context to store the comment map. Though of course the generate method itself could also do this, rather then each producer.
We should be careful to use any comment only once. So in the above example, maybe leading comments are left to the first element of batch. The pattern should fall out from there. Again, note that this test is about the approach and my not cover all cases.
Adorning the TreeNodes seems the better approach.
Additional Context
Starting point for comment preservation test
Note that most dialects allow nested comments, our lexers may need to be adjusted.
Non-functional requirements:
com.databricks.labs.remorph.intermediate
package should be avoided if possible and iff we need to make a change to any structures in this package, we have to provide different alternatives.