semantic: Discrepancy in parsing of header comments

r0mainK commented 4 years ago

Basically, if we have:

// comment 1
// comment 2
package foo
// comment 3
// comment 4
import "bar"

Then:

comment 1, comment 2, comment 3 and comment 4 go in the Comments attribute of the root node
comment 1 and comment 2 go in the Doc attribute of the root node
comment 3 and comment 4 go in the Doc attribute of the GenDecl node in the root node's Decl attribute

However, if we have:

// comment 1

// comment 2
package foo
// comment 3

// comment 4
import "bar"

Then:

comment 1, comment 2, comment 3 and comment 4 still go in the Comments attribute of the root node
only comment 2 goes in the Doc attribute of the root node
only comment 4 goes in the Doc attribute of the GenDecl node in the root node's Decl attribute

Basically, any comments placed before a newline before the package/import keywords get removed from the Doc attributes of the respective nodes.

dennwc commented 4 years ago

This is how Go identifies which comments correspond to which node: if the comment is immediately before the line, it's attached to a node.

In terms of Babelfish, this may be considered a bug, but not for the Native mode, since it's by definition "whatever the native parser produces".

However, I think the same is true for Semantic, so we should place comments to corresponding places at lest in that mode.

r0mainK commented 4 years ago

Okay, updating the title to sematinc: ... then.

kuba-- commented 4 years ago

It's somehow similar to how CDT handles comments in cpp-driver. I think we've already discussed it, and we decided that it's hard to define which comment belongs to what statement moreover what does it mean that comments belongs to some nested node in tree. As a consensus we agreed to put comments like flat list on the same, top level under root. In other words it will be no nested comments assigned to some nodes. WDYT?

dennwc commented 4 years ago

@kuba-- Correct, but I was thinking about CDT issue in terms of Native mode. Then we can add a processing stage for Semantic to place those comments (as separate nodes) inside the tree based on positions.

Same can be done for Go here. But we may need to pull some comments out from existing nodes first.

kuba-- commented 4 years ago

But I'm not sure if it's really needed. Comments are comments: text + position. I'm not sure if we really need to complicate it. So far (based on use cases), I don't see the reason.

Personally, I would keep them as a flat, separate branch from root node. I would even argue, that it may simplify analysis (avoid some comments noise). And if we really need some comments analysis, maybe we can write a tool/lib, which let match nodes to comments based on positions.

dennwc commented 4 years ago

Yeah, I agree about simplicity of traversal. This is one of the reasons why I advocate for graphs: comments may point to nodes, not nodes to comments. This way they won't appear in AST during the traversal, but are still easily accessible.

But for now we have to work with trees. Having a flat list of comments with positions works for advanced users that can map them back to nodes, but not for a simple use case. For example, if you open play.bblf.sh and paste the code with comments, you will expect them to appear near relevant nodes (but not in Native, as the issue points out).

kuba-- commented 4 years ago

I would ask opposite question - do people really care where they see comments in tree. I know what we have in bblfsh playground, so far. My question is what could be wrong if we change it? And honestly, I have a feeling that people who are focused mainly on comments they will have an easier task to parse it. Maybe we should ask first, what can be use case, instead of trying to guess. I really believe that having comments as independent branch (from source code) and let them match nodes (later based on positions) can be easier to implement and more readable comparing to what we have (where we gather info, first and then propagate it)

r0mainK commented 4 years ago

From a user's point of view I would say this. Currently in some drivers, this one included, we have both implemented:

a flat list at the root node of the UAST, here the Comments attribute
nodes containing comments near the relevant nodes

Now, I would tend to say you ought to keep only the first and actually generalize this, in semantic mode, to all drivers. That is because as you said it's easy to map the comments using positional information if need be, comments do not need to be augmented with structural information imo, and I tend to think that for a lot of use cases one would simply extract all comments. That being said, I use Babelfish often, so I might be biased, and I can see how having comments at their correct position might help.

What I think is that, unless you want to follow a data-sparse approach in which case picking one or the other should be done, then there is no problem with having both. But in that case, I think it would be good to say so in the doc, possibly port the flat representation to drivers where it does not exist, and finally fill the gaps where they exist, e.g. here where some comments appear only in the flat representation - because these kind of discrepancies will definitely confuse users.

creachadair commented 4 years ago

Mapping comments to syntactic constructs is very tricky to get right, and relies on conventions that differ across languages. There is no single rule that gives the "right answer" across languages, because the way people write comments feeds back on how the tools interpret them—meaning you write comments in different layouts if you expect different interpretation.

If we want to pick a universal rule, it's going to have to be pretty simplistic: Having a flat list of comments for the whole file, for example, should always work—even if we have to synthesize it for languages that don't provide it natively. The corner cases about what counts as "adjacent" are very tricky, and are not treated the same way between tools like Doxygen, Javadoc, gofmt, clang-format, and so on.

I agree we should try to provide users with a reasonable experience, but I don't foresee being able to pick a simple rule that accomplishes that. I did an extensive survey of comment layout rules at my last job, and if I still have the notes from it I will try to post them somewhere so you can get a sense of how annoyingly inconsistent people are about their expectations. 😀

bblfsh / go-driver

semantic: Discrepancy in parsing of header comments #56