[FEATURE]: IR to Listen and Generate Code Comments

Is there an existing issue for this?

[X] I have searched the existing issues

Category of feature request

Transpile

Problem statement

Currently the ANTLR Parser moves all comments into Hidden Channel. Need to map them to concrete IR and introduce them at appropriate place during code generation.

Proposed Solution

Resolving comments back to output in a sensible manner is not a trivial task and first we must establish some principles in the source language and target language.

Goal - Generate comments in the translated code and preserve their context wherever possible.

In our case the source and target languages can be considered equivalent as the comments have common syntax, so there are no complications caused by the need to resolve comment structure; for instance if the source allows multi-line in-syntax comments (/* xxx */), but the target only allows line comments (-- xxx).

Additionally, as we will pretty print the resulting output, we can ignore trying to preserve the number of spaces and so on that precede or follow a particular token. So long as the correct comment style is placed in the correct place, then the formatter will make it all look good.

Comment style

Using the Databricks documentation for some examples, and adding our own, we can create a sample input that challenges our system such that if the test input is translated correctly, then all input should be translated correctly. The starting point for such a test is in the Additional context section. Note that this is a starting point and has not been thought through exhaustively.

If we study the test we can establish some principles around comment association, knowing that as we are not actually reading the comments, we make guess the associativity incorrectly, but by using these principles, we should get the intended output anyway.

The principles are therefore:

If a comment follows some token on the same line, then it likely belongs to that token (though we know that people's intent might be the opposite).
A comment that follows some token on the next line is likely associated with the following token, even if that token is EOF, therefore -
Any comment preceding a token but not on the same line, belongs to that token, even if it is EOF, except
Multiline comments ('/ /') may be intended to address the previous token but we will choose to associate them with the following token.
Single line comments should always be rendered with a trailing new line
Multi-line comments should be rendered with a trailing new line if they do not end on the same line as the following token with which they are associated.
Ambiguity is always solved with the nearest following token (we could modify this, if say the comment starts on the same line as the previous token and does not end on the same line as the following token.

These principles will need to be tested in practice - these kinds of things always have strange edge-cases, where the source author thought they were being mighty clever.

Procedure

We have talked about tokens, but in fact our IR does not preserve tokens, so we need to somehow match our Ir nodes to comments. We could, when building the Ir, map each node we produce with a lookup table. However, as all nodes derive from TreeNode (or should), we can in fact adorn the nodes with the Origin information. Although, the Origin may need to be extended to include stop line and offset - at first glance it may or may not already have enough information to create a span. However, this means that any Ir node we produce must be adorned with the origin information for a token or start stop token, which is slog work but easily achieved. (NB: Don't couple the TreeNodes with ANTLR).

Comments are stored out-of-band by the lexers, on the HIDDEN channel, which is channel 1. Hence they are not consumed by the parser, but are easily scanned by anything else with a reference to the token stream by resetting it after parsing, then asking for all the tokens on the HIDDEN channel. While traversing this HIDDEN channel (or maybe a custom one for comments), we can build a comment map for each input line (note there may be more than one comment on a line.

The approaches here are then to either walk the tree with the comment map and assign the comments as preceding comments to the nearest node. The codegen then just produces any comments before it produces it's own text. We would have:

generate(ctx, node) : String = 
 s"${ctx.generateComments(node)}${whatever}"

If we use the map and not adorn the nodes then at code generation, every production needs something like:

private def batch(ctx: GeneratorContext, b: Batch): String =
s"${ctx.comments.findLeading(b)}${generate(ctx, query)}${ctx.comments.findTrailing(b)}"

Assuming we used the generator context to store the comment map. Though of course the generate method itself could also do this, rather then each producer.

We should be careful to use any comment only once. So in the above example, maybe leading comments are left to the first element of batch. The pattern should fall out from there. Again, note that this test is about the approach and my not cover all cases.

Adorning the TreeNodes seems the better approach.

Additional Context

Starting point for comment preservation test

/* A comment, which associates with the next token. */
-- This comment also belongs with the next token
SELECT venues FROM LUFC.matches;

 SELECT /* This refers to the COUNT()
                    and will be rendered with a newline because it does not end
                    one the same line as the count token. */
          COUNT(players) in LUFC.squad;

SELECT 1 ; -- This will associate with the '1' and stay on the same line as that is the case in the source

-- Note that commas on the next line is a pattern I have seen a lot (associates with SELECT
SELECT
     Name  // associates with name, Origin is ir.Column for Name
    , Age -- associates with Age
    FROM /* associates with next table */ LUFC.players ;

-- Closing comments associate with a EOF, so probably ir.Batch 
/* Also with /* EOF */
  * etc
  */

Note that most dialects allow nested comments, our lexers may need to be adjusted.

Non-functional requirements:

changes to com.databricks.labs.remorph.intermediate package should be avoided if possible and iff we need to make a change to any structures in this package, we have to provide different alternatives.

Alternatives for the representation:

A: we add comment to TreeNode
B: we add new CommentNode and Comment case classes

"structuring comments" should {
      "option a: be part of LogicalPlan node as a field" in {
        // This is the _most straightforward_ way to structure comments
        ir.Project(namedTable("t"), Seq(ir.Id("c1")), comment = Some("This is a comment")) generates
          """/* This is a comment */
            |SELECT c1 FROM t""".stripMargin
      }

      case class CommentNode(child: ir.LogicalPlan, comment: String) extends ir.UnaryNode {
        override def output: Seq[Attribute] = child.output
      }

      case class ExpressionComment(child: ir.Expression, comment: String) extends ir.UnaryExpression

      "option b: introduce a separate node" in {
        // This structure doesn't change the framework's behavior
        ir.CommentNode(
          ir.Project(
            namedTable("t"), Seq(ir.Id("c1"))),
            "This is a comment") generates
            """/* This is a comment */
               |SELECT c1 FROM t""".stripMargin
        )
      }

      "comments on expressions" should {
        /**
         -- this is a top-level line
         SELECT
          c1 -- this is a comment
         FROM t
         */
        "option a: adding a comment to a column" in {
          ir.Project(namedTable("t"), Seq(
            ir.Id("c1", comment = Some("this is a comment")),
          ), comment = Some("This is a comment"))
        }

        "option b: introducing a separate node for the comment" in {
          ir.CommentNode(
            ir.Project(
              namedTable("t"), Seq(
                ExpressionComment(
                  ir.Id("c1"),
                  "this is a comment")
              )),
              "This is a comment")
        }
      }
    }

python corner cases:

// SQL:
//         -- this is a top-level line
//         SELECT
//           c1 -- this is a comment
//         FROM t
//         WHERE foo /* HINT: x > z */ > bar;
//         ;
//
//         ---->
// Python:
//
//         import pyspark.sql.functions as F
//         # this is a top-level line
//         spark.table("t") \
//          .select(
//            F.col("c1").alias("c1") # this is a comment
//          ) \ # this is a trailing comment
//          .filter(
//            F.col("foo") > F.col("bar") # HINT: x > z
//          ) # this is a trailing comment

and in example IR optimizer rules for tranforming WHERE foo /* HINT: x > z */ > bar;"

override def apply(expr: ir.Expression): ir.Expression = expr transformUp {
    // option A:
    case ir.And(left, right, comment = Some(text)) => py.Comment(methodOf(left, "and", Seq(right)), text)
   // and we'll probably have to change the invariants for ~200 nodes

   // option B:
    case ir.And(Comment(left, text), right) => py.Comment(methodOf(left, "and", Seq(right)), text)
    case ir.And(left, Comment(right, text)) => py.Comment(methodOf(left, "and", Seq(right)), text)

    // option C:
    case x @ ir.And(left, right) if x.hasComment => py.Comment(methodOf(left, "and", Seq(right)), text)
   // and we'll probably have to change the invariants for ~200 nodes

flowchart TD
    antlr --> lib
    lib --> commonlex.g4

    lib --> grammars

    grammars --> snowflake
    snowflake --> SnowflakeLexer.g4
    snowflake --> SnowflakeParser.g4
    snowflake --> basesnowflake.g4
    SnowflakeLexer.g4 --> commonlex.g4
    SnowflakeLexer.g4 --> basesnowflake.g4

    grammars --> tsql
    tsql --> TSqlLexer.g4
    tsql --> TSqlParser.g4
    TSqlLexer.g4 --> commonlex.g4
    TSqlLexer.g4 --> basetsql.g4
    tsql --> basetsql.g4

Following a number of tentative PRs and discussions, I am proposing to discuss a lightweight spec here (as opposed to a full solution architecture spec which is overkilling for the discussion we need). We have a number of technical constraints:

using ANTLR, we can't simultaneously read comments and sql without rewriting the grammars and making them a complete mess. Attaching comments to nodes can only be done as a second pass, or a transformation. This requires attaching parsing information to the IR nodes
we're using catalyst as our IR, and catalyst has no data structure for attaching parsing information
changing catalyst significantly makes it impossible in the future to merge our changes with the original codebase which we have forked in order to shied ourselves from its instability

Per my various experiments (that work) the following topics need to be discussed. I'll get into details in separate comments.

1 - Origin is not fit for purpose 2 - A parsing location needs to be attached during building of the IR nodes 3 - Our simple parsing pipeline does not allow a second pass 4 - Attaching comments to nodes can be done in various ways

1 - Origin is not fit for purpose

The definition of Origin is as follows:

case class Origin(
    line: Option[Int] = None,
    startPosition: Option[Int] = None,
    startIndex: Option[Int] = None,
    stopIndex: Option[Int] = None,
    sqlText: Option[String] = None,
    objectType: Option[String] = None,
    objectName: Option[String] = None)

This structure is not accurate enough:

fields are optional, which requires a lot of avoidable code when using the structure
not sure what startPosition means vs startIndex
it's missing a stopLine, a startTokenIndex and a stopTokenIndex without which
objectType and objectName are meaningless in our context

Moreover, the origin field in TreeNode is initialized via a singleton - no further comment - which makes it very flaky.

Due to backwards compatibility concerns, we can't refactor Origin.

Instead, I propose to introduce the following:

case class ParsedLocation(line: Int, column: Int, tokenIndex: Int)

object ParsedLocation {
  val empty: ParsedLocation = ParsedLocation(-1, -1, -1)
}

case class ParsedLocationRange(start: ParsedLocation, end: ParsedLocation) {
  def isAfterLine(line: Int): Boolean = start.line > line
}

object ParsedLocationRange {
  val empty: ParsedLocationRange = ParsedLocationRange(ParsedLocation.empty, ParsedLocation.empty)
}

and a locationRange field in the TreeNode class

StartLine is the human interpretation of how many \n there are StartPosition is the position in that line StartIndex is the offset from the very beginning of the text End versions are the same.

I think Origin is fit for purpose information wise for resolving which IR node is deemed closest to a collection of comments. Though we can extract comments from the token stream, I think we could just build a collection of them directly in the common lexer.

We can talk about it next week.

2 - A parsing location needs to be attached during building of the IR nodes

The question arises as to how we attach such location to the node. There are various options:

attach it to a val field at construction time. This is the perfect time during parsing. It significantly affects existing catalyst code because it requires introducing an optional parseLocation parameter to the constructor(s) of any class that inherits from TreeNode. Using it does not require more code than would be required to populate a singleton in a deterministic way. It does not affect performance. Since the field is defaulted, it does not guarantee that it's properly populated. This was prototyped (using Origin) in #1189
attach it to a val field via a withParsedLocation(newLocation) transform that returns a copy of the TreeNode with the new parseLocation. It does not significantly affect existing catalyst code. This is elegant but requires more code than 1. , and will affect performance (copying is not free). It does not guarantee that the parseLocation field is properly populated.
make parseLocation a var field, attach it via a withParsedLocation(newLocation) method that returns the same TreeNode instance with the updated parseLocation. This requires more code than 1, same amount of code as 2. It goes against the FP style of catalyst, but is more performant than 2. It does not guarantee that the parseLocation field is properly populated.
automate 3 using helpers that ensure that the parseLocation field is properly populated. This is prototyped in #1203

I prefer 1 because it augments catalyst with a crucial capability. The lack of guarantees can be mitigated by a simple test that travers the tree and checks that each node has a non-default parseLocation

Maybe something like this is more practical to implement:

Gather a collection of all the comments directly within the lexcommon.g4, then they can be skipped as tokens We have a CommentManager or CommentProcessor that is given the collection to manage When any IR node is created, we use the closest ctx or token to create a map of Ir nodes to Origins. If the optimizer creates new nodes and replaces others, then it needs to update the node map (but then we are going to have to do that anyway I think). nodeMapper.replace( a, b) nodemapperDelete(a, b) `...

To process comments:

sort both collections via location (index, or maybe line and col) or store them that way in the first place
before or after code gen for a node, we ask the comment manager if the node we are processing is the closest one to any comment
Because the rules are well defined, each comment will only be closest to a single tree node
Generate the comment(s) we are given back
Generate the code for the node
Ensure we do not forget about trailing comments where no node will follow it.

Of course, this not very Catalyst like. But

We do not need to change Origin
We do not need to change TreeNode Ir (unless we id it by say a UUID)

We can adorn the Ir with Origin as per Valentin's idea as well, though I think as at some point we need to resolve which comment is for which node anyway. I think the above is very simple at least.

3 - Our simple parsing pipeline does not allow a second pass

Context: grammars focus on meaningful content. Whitespace and comments are parsed but sent to hidden channels, thus skipped during matching of grammar rules. Without this, since comments could be literally anywhere in the source code, a rule like the following: CREATE (OR REPLACE)? (STAGE | FILE FORMAT | SEQUENCE | STREAM | TASK) (IF NOT EXISTS)? dotIdentifier CLONE dotIdentifier would have to be written as follows: COMMENT? CREATE COMMENT? (OR COMMENT? REPLACE COMMENT? )? (STAGE | FILE FORMAT | SEQUENCE | STREAM | TASK) COMMENT? (IF COMMENT? NOT COMMENT? EXISTSCOMMENT? )? dotIdentifier COMMENT? CLONE COMMENT? dotIdentifier COMMENT?

Using ANTLR, you can only have one input stream, and that stream can only listen to 1 channel. Thus, you cannot simultaneously read from the main and the comments channel. It needs to be done in 2 passes.

Our current parsing pipeline (in PlanParser) assumes that all can be done in 1 ANTLR pass, it chains the input stream, the lexer, the token stream, the parser and the plan generator in a way that makes it impossible to access the token stream when building the plan (or once it's built):

def parse(input: Parsing): Transformation[ParserRuleContext] = {
    val inputString = CharStreams.fromString(input.source)
    val lexer = createLexer(inputString)
    val tokenStream = new CommonTokenStream(lexer)
    val parser = createParser(tokenStream)
    addErrorStrategy(parser)
    val errListener = new ProductionErrorCollector(input.source, input.filename)
    parser.removeErrorListeners()
    parser.addErrorListener(errListener)

    update { case _ =>
      input
    }.flatMap { _ =>
      val tree = createTree(parser)
      if (errListener.errorCount > 0) {
        lift(PartialResult(tree, ParsingErrors(errListener.errors)))
      } else {
        lift(OkResult(tree))
      }
    }
  }

Beyond the fact that createTree knows nothing about the tokenStream (which must be addressed), the above assumes 2 things:

ANTLR will always be used for parsing (which contradicts our Multiplexer strategy)
there will never be a need for a custom pipeline (such as requiring a TokenStreamRewriter)

PRs #1200 and #1201 provide a simpler yet more flexible approach: a concrete PlanParser is expected to produce a LogicalPlan from a Parsing input, regardless of the underlying parsing technology being used. This somewhat duplicates the pipeline logic if parsers use ANTLR, but that's a very small cost compared to the flexibility being provided.

We need to opine on this.

3 - Our simple parsing pipeline does not allow a second pass

Context: grammars focus on meaningful content. Whitespace and comments are parsed but sent to hidden channels, thus skipped during matching of grammar rules. Without this, since comments could be literally anywhere in the source code, a rule like the following: CREATE (OR REPLACE)? (STAGE | FILE FORMAT | SEQUENCE | STREAM | TASK) (IF NOT EXISTS)? dotIdentifier CLONE dotIdentifier would have to be written as follows: COMMENT? CREATE COMMENT? (OR COMMENT? REPLACE COMMENT? )? (STAGE | FILE FORMAT | SEQUENCE | STREAM | TASK) COMMENT? (IF COMMENT? NOT COMMENT? EXISTSCOMMENT? )? dotIdentifier COMMENT? CLONE COMMENT? dotIdentifier COMMENT?

But I am saying we pick them up in the common lexer then skip them. They won't even be in the token stream.

Using ANTLR, you can only have one input stream, and that stream can only listen to 1 channel. Thus, you cannot simultaneously read from the main and the comments channel. It needs to be done in 2 passes.

But now we do not need two passes. But even if we did, we can modify the generic plan parser to take care of that in a simliar way that generate does.

Our current parsing pipeline (in PlanParser) assumes that all can be done in 1 ANTLR pass, it chains the input stream, the lexer, the token stream, the parser and the plan generator in a way that makes it impossible to access the token stream when building the plan (or once it's built):
def parse(input: Parsing, commentCollector: CommentProcessor /* more likely add to Parsing() */): Transformation[ParserRuleContext] = {
    val inputString = CharStreams.fromString(input.source)
    val lexer = createLexer(inputString)
    val tokenStream = new CommonTokenStream(lexer)
    val parser = createParser(tokenStream)
    addErrorStrategy(parser)
    val errListener = new ProductionErrorCollector(input.source, input.filename)
    parser.removeErrorListeners()
    parser.addErrorListener(errListener)

    update { case _ =>
      input
    }.flatMap { _ =>
      val tree = createTree(parser)

       // Here we create the comment map created by the lexer and put it somewhere
       commentMap.load(lexer.comments)

  if (errListener.errorCount > 0) {
    lift(PartialResult(tree, ParsingErrors(errListener.errors)))
  } else {
    lift(OkResult(tree))
  }
}

}



Beyond the fact that `createTree` knows nothing about the `tokenStream` (which must be addressed), the above assumes 2 things:

We don't/should not need the token stream at that point. Ir creation should not rely on the token stream, but use the tokens it needs in the ParserContext. However, the Result can carry information from previous phases should we ned to do that and we can find the token stream should we need it.

ANTLR will always be used for parsing (which contradicts our Multiplexer strategy)

No - that is not where the replacement converter will go, that is far too low level. There will be some sort of program level interface defined (say yml that says 'her is the input location' 'here is the output location') and a different converter would probably be an entirely different process/engine.

there will never be a need for a custom pipeline (such as requiring a TokenStreamRewriter)

We can insert more phases in PlanParser./

PRs #1200 and #1201 provide a simpler yet more flexible approach: a concrete PlanParser is expected to produce a LogicalPlan from a Parsing input, regardless of the underlying parsing technology being used. This somewhat duplicates the pipeline logic if parsers use ANTLR, but that's a very small cost compared to the flexibility being provided.

I do not agree with this. We are not solving a problem that we have right now and at the moment I cannot see why we would need something like a token stream rewriter.

We need to opine on this.

We are not solving a problem that we have right now

We do have a problem to solve. The fact that the provided solution also solves a future problem is a side-effect but things can't stay as is.

4 - Attaching comments to nodes can be done in various ways

Once comments are discovered, and the node to which they should be attached is located, we need to technically attach them to that node. They need to be attached before (for example line comments that appear before a statement) or after (for example line comments that appear after some code on the same line)

There are at least 3 ways comments could be attached to nodes:

A - using preComments and postComments fields. This is straightforward, and experimented in PR #1203. It could be generalized by moving these fields to LogicalPlan or even TreeNode

B - adding comments as children. This is tricky for 2 reasons:

nodes inheriting from LeafNode are expected to NOT have children.
children are below, not before or after, so we can't just add them as children

C - wrapping nodes in Comment nodes. The Comment node would contain the commented node. This seems handy at first glance because it does not require changing the IR model. But it pushes complexity to actual nodes and transformers, that now need to be aware of comments when using or manipulating the tree. For example, let's consider the following existing code:
```
case class Project(input: LogicalPlan, columns: Seq[Expression]) extends UnaryNode {
override def child: LogicalPlan = input
// TODO: add resolver for Star
override def output: Seq[Attribute] = expressions.map {
case a: Attribute => a
case Alias(child, Id(name, _)) => AttributeReference(name, child.dataType)
case Id(name, _) => AttributeReference(name, UnresolvedType)
case Column(_, Id(name, _)) => AttributeReference(name, UnresolvedType)
case expr: Expression => throw new UnsupportedOperationException(s"cannot convert to attribute: $expr")
}
} 
```
Using Comment nodes, the above code would need to change significantly since every Expression could now be a Comment (that wraps an Expression)... This will require massive changes

Whatever we decide, nodes are supposedly immutable, so this is not a trivial task. We can either accept that it's ok to have mutable nodes during parsing, or accept that we need to run a full transformation on the entire tree, which affects performance

Overall we are extremely constrained by the software architecture decision to use Catalyst for our IR, and the informal goal of contributing back to it. Catalyst is designed for optimizing execution plans on distributed computing platforms. That is its sole purpose. It addresses a problem we don't have: we only do transpilation on a desktop. To help achieve its original purpose, it enforces the FP programming model which is great for running distributed software on a platform, but does not fit well with ANTLR which is deeply OOP. Moreover, we've forked the source code, but we're only allowed to evolve it marginally, because we have the ambition to contribute back. This impedance mismatch makes it very difficult to perform parsing tasks, at least using ANTLR, and only in order to satisfy the above, forces us to produce massive and convoluted code (see #1203), or avoid using standard ANTLR mechanisms (see https://github.com/databrickslabs/remorph/issues/869#issuecomment-2479371073).

We could at minimal consider alternative options:

have parsers build a concrete representation that closely follows that of the grammar and is designed for parsing using ANTLR. That representation would then be converted to a Catalyst LogicalPlan, and we would still benefit from Catalyst's transformation capabilities.
drop the goal of contributing back to catalyst, or even incorporating its evolutions. This would drop the requirement of not implementing significant changes to our copy of Catalyst, and relax our strict adherence to the FP model, at least during parsing.
... other proposals are welcome

databrickslabs / remorph