Calculate leading/trailing comments

nzakas commented 7 years ago

Problem

Right now, leading and trailing comments are calculated in Espree and passed to ESLint. ESLint depends on these comments from things like configuration and determining if someone has opted-out of a rule (such as providing a comment in an empty block). That works well when using Espree, but if you're using a different parser, it forces that parser to reimplement the same comment attachment strategy as Espree, which is a lot of work and not very easy. Case in point, we're working on typescript-eslint-parser and are a bit stuck on getting the same leading/trailing comment information from Espree.

Solution

If we didn't require parsers to provide leading/trailing comments, and instead calculated them inside of ESLint, we dramatically decrease the complexity of creating ESLint-compatible parsers. We could just require parsers to pass in the comments array, and we could figure out where the comments belong on-demand when a rule calls context.getComments(). This is probably a bit tricky, as we'd want to ensure we get the same results that Espree gives us right now, but ultimately, I think this would help the parser ecosystem tremendously and, as a side benefit, normalize ESLint behavior across parsers.

ilyavolodin commented 7 years ago

Question, does Espree currently makes a second pass to attach comments? Moving this into ESLint would require second pass. If Espree does it in a single pass, moving it into ESLint core will slow things down quite a bit.

nzakas commented 7 years ago

Espree does the attaching as it parses, so there's no second pass. This calculation does add extra time to the parser -- I'll get some hard numbers when I'm able.

What I'm suggesting did not require a full traversal so it wouldn't add a second pass.

I don't believe what I'm suggesting will slow things down a lot because we will likely not be calculating comments for every node, only for the ones specifically requested by rules. In theory, this could be more efficient than what we are doing now due to speeding up Espree parsing.

I don't think we should evaluate this idea based on performance, though. It should be evaluated on the basis of being the right choice for the ecosystem. If it's the right choice, we can deal with implementation issues as they occur.

ilyavolodin commented 7 years ago

We shouldn't evaluate idea purely on performance, but it's one of the variables that goes into the list of pros and cons. If performance isn't going to suffer greatly, I don't think it should be blocking. Out of curiosity, which parsers out there do not support attaching comments? Esprima can do it, right? So can Acorn and Babylon. Are we talking about future hypothetical parsers?

mysticatea commented 7 years ago

I think, simply we can publish the package to attach comments as postprocessing to npm? The developers of parser plugins can use it if necessary.

nzakas commented 7 years ago

@ilyavolodin my point is that we should never say, "I think this will be slow, so let's not do it." We should say, "this is the right thing to do, so let's see if we can make it work." Your opening position was that this would be slow without mentioning if you thought the idea had merit or not. We shouldn't operate like that.

I think you'rea bit confused about comment attachment. I'm talking about adding leading/trailing comment arrays to nodes. Esprima does that, but in a way that's different from Espree such that not all of our tests pass. Acorn does not do that at all. Babylon does comment attachment in a different way. And as I mentioned in the original description, we are running into trouble trying to replicate Espree comment attachment in typescript-eslint-parser.

@mysticatea I don't think we can do that. Espree's logic happens during parsing, and what I'm talking about is only calculating comments dynamically when requested.

ilyavolodin commented 7 years ago

@nzakas Sorry, I think you misunderstood me. I haven't worked with comment attachments before, those messages above were honest questions, not sarcastic way to say we don't need this. I was trying to gather information about how this stuff currently work, to understand how proposed modification might affect it.

kaicataldo commented 7 years ago

@ilyavolodin Acorn provides a hook (check out the onComment option in https://github.com/ternjs/acorn#main-parser) that Espree uses to collect and attach comments as it parses, which allows us to process the comments in a single pass.

I love the idea of moving this into ESLint core! The fix I worked on for a comment attachment bug in Espree ended up having to be manually copied over to Babylon, and took a while to get merged (it wasn't ported over by me, but I was watching the progress on it). If we could provide this it would definitely help out our users.

As an aside, this is something I'd be happy to look at and work on after we make some more progress on the JSCS compatibility issues.

nzakas commented 7 years ago

@ilyavolodin I didn't think you were being sarcastic, I was just trying to get focus in on whether or not this seems like a good idea absent implementation concerns.

@kaicataldo cool! Let's see if others like this idea, too. If so, I'll share some of the implementation thoughts I have.

JamesHenry commented 7 years ago

Well... you know I love this idea 😄

nzakas commented 7 years ago

If there are no objections, I'd like to mark this as accepted because it's a big blocker for the TypeScript work.

nzakas commented 7 years ago

TSC Summary: Currently, Espree attaches comment information to nodes (via leadingComments and trailingComments) and then ESLint uses that information. This expectation means that every parser that wants to work with ESLint must implement the same algorithm for comment attachment. This proposal is to move the comment attachment calculation into ESLint core, so that parsers just have to pass in the top-level comments array and ESLint will figure out where it goes. At the moment, this is a blocker for the TypeScript parser.

This is also potentially a breaking change since it would remove the leadingComments and trailingComments properties from nodes. While we explicitly say to use sourceCode.getComments() instead, I have come across instances in the wild of people relying on leadingComments and trailingComments. So we may need to flag this as a breaking change.

TSC Question: Is it okay to move forward with implementation?

ilyavolodin commented 7 years ago

Accepted per 8/18 TSC meeting.

JamesHenry commented 7 years ago

Excellent news! I feel somewhat responsible for this, so if I can be of any help let me know 😄

platinumazure commented 7 years ago

@JamesHenry Are you volunteering to implement the whole thing? I think you are-- thank you very much! :stuck_out_tongue_winking_eye:

nzakas commented 7 years ago

@kaicataldo @JamesHenry do either of you want to tackle this?

JamesHenry commented 7 years ago

I will take a stab at it and perhaps submit my work to @kaicataldo for discussion?

I believe it is quite a difficult task. Without an existing spec for how comments should be attributed, I believe we just have to find a way reverse-engineer existing test assertions into some well-defined logic. I attempted something similar to this already in the typescript parser but was forced to give up at the time.

Hopefully, now that the task is a bit better defined we should be able to get it over the line.

kaicataldo commented 7 years ago

@nzakas @JamesHenry Sounds like an interesting challenge! Happy to help in any way I can (whether it's working on it myself or collaborating/discussing with @JamesHenry). I've worked on comment attachment in Espree - so have some working knowledge in that area - but I'm not an expert by any means :)

nzakas commented 7 years ago

The expected behavior is in the Espree tests (https://github.com/eslint/espree/tree/master/tests/fixtures/attach-comments). Here's what I'd suggest as a path forward:

Convert the Espree tests into tests for the SourceCode#getComments() method such that they pass right now.
Turn of the attachComment option when ESLint calls parse().
Attack each test failure.

In my mind, the algorithm looks something like this:

For leading comments: Call getTokenOrCommentBefore() in a loop, adding each comment to an array. When a token is retrieved instead of a comment, you're done.
For trailing comments; Call getTokenOrCommentAfter() in a loop, adding each comment to an array. When a token is retrieved instead of a comment, you're done.
Store this information in a WeakMap so multiple calls to getComments() don't have to do the search again.

I think that should get you most of the way there.

JamesHenry commented 7 years ago

Thanks @nzakas, let's see how it goes :)

kaicataldo commented 7 years ago

@JamesHenry Are you planning on working on this? Sounds like you are, but I have time this weekend to take a look at this so wanted to confirm :)

JamesHenry commented 7 years ago

Well if you don't mind then I would love to take you up on that offer. I have about 7 other things I could be doing to keep the typescript-eslint-parser project motoring, and working on the comment attachment would naturally come at the expense of (some of) those.

If you don't get chance to finish or want a second pair of eyes on something please let me know, and thanks a lot for offering!

ariya commented 7 years ago

Can't escodegen.attachComments be used here?

kaicataldo commented 7 years ago

@ariya Had a little time to look into this this weekend. I tried using estraverse.attachComments - since it looks like that's what escodegen.attachComments calls - and got different results from the current implementation. Can definitely look into it more though!

Starting in on @nzakas's suggested path forward, I have the method itself working - however, have a question about a specific case. Right now, we attach the comment in the following example as a trailing comment to the ObjectExpression node:

function a() {
    var b = {
        // comment
    };
    return b;
}

This behavior seems wrong to me, though. I think the comment shouldn't be attached, since it's not leading or trailing the node, and is instead contained by it. On the flip side, I would expect the comment to be attached to the Property node inside the ObjectExpression node as a leading comment in the following:

function a() {
    var b = {
        // comment
        c: true
    };
    return b;
}

We don't attach comments that are inside nodes when the node isn't empty - not sure why we it makes sense to do it when it is. This could also be a bug in comment attachment that we just haven't encountered. Thoughts on this?

nzakas commented 7 years ago

@ariya We can't use escodegen.attachComments because the attachment is different. That's part of why we added comment attachment into Espree directly (also, doing it in Espree as much faster).

@kaicataldo that's an interesting case. I agree, it seems wrong to attach that comment as a trailing comment to the object expression, but on the other hand, where would we attach the comment otherwise? So it's a bug, but I think a useful one because without it, we would miss the comment in our traversal altogether (without some magic).

Do you have a suggestion for how we could solve this problem instead?

kaicataldo commented 7 years ago

We actually do the opposite if a parsed file only contains a comment and attach it as a leading comment to the program node (as opposed to a trailing comment in the example in my comment above). Is there any value to adding a new kind of comment key for cases where a node only contains a comment (what about a comments key? We could then just leave an empty file's program level key as comments as well)? At the very least, I think we should be consistent between this case and the empty file case and attach both as a leading or trailing comment.

Thanks for discussing!

nzakas commented 7 years ago

The empty-file-except-comment case is a special case, so I'm not sure we can look to that as a good example of what to do here.

Maybe it's worth having the concept of "child comments" when there are no child nodes except for comments? That would mean getComments() would return an object like:

{
    leading: [],
    children: [],
    trailing: []
}

kaicataldo commented 7 years ago

I'll have to figure out how to check the inside of nodes for comments, but I like that idea! Just to confirm, this is only when there is a comment but no child nodes to attach to, correct?

JamesHenry commented 7 years ago

Awesome progress, guys! Thanks for working on this @kaicataldo

nzakas commented 7 years ago

@kaicataldo that is correct. What I would check:

BlockStatement.body is empty (zero-length array)
ClassBody.body is empty (zero-length array)
ObjectExpression.properties is empty (zero-length array)

In those situations, then you should search into the syntax to see if there's a comment.

kaicataldo commented 7 years ago

Edit: I think I found an alternate solution. It's very much a WIP (need to refactor, port over the Espree tests, and actually turn comment attachment off when parsing with Espree), but for anyone interested in/willing to give feedback on the algorithm itself: https://github.com/eslint/eslint/tree/commentattachment

~~Thanks! I have another case I wanted to discuss, after working on this a bit more tonight.~~

When using the strategy we've discussed above (calling getTokenOrCommentBefore() and getTokenOrCommentAfter() in a loop and stopping when comments are no longer found), the following examples produce the following outputs:

var zzz /*aaa*/ = 777;

~~This example attaches /*aaa*/ to the Identifier zzz as a trailing comment.~~

var zzz = /*aaa*/ 777;

This example attaches /*aaa*/ to the Literal 777 as a leading comment. Both of these behaviors make sense to me, however, it doesn't match the current behavior of Espree, as shown in the following example:

var zzz /*aaa*/ = 777;

Espree currently attaches the comment /*aaa*/ as both a trailing comment to the Identifier zzz and as a leading comment to the Literal 777, which means that we have to allow for continuing past certain punctuation tokens while calling getTokenOrCommentBefore() and getTokenOrCommentAfter() in a loop.

~~I initially tried to just allow continuing over punctuation and checking the location of tokens, but that doesn't work with the following scenario:~~

function a() {
    var b = {
        // comment
    };
    return b;
}

~~In this case, the comment would be attached to the Identifier b, when it should only be a child comment of the object expression.~~

I got the current (and some additional) tests passing with what is essentially a rudimentary (and currently incomplete) operator whitelist, but was wondering if anyone else had ideas for a better way. Thanks for all your input!

nzakas commented 7 years ago

@kaicataldo can you provide a link to the changes or open a WIP PR? That makes it a lot easier for others to know where to look.

kaicataldo commented 7 years ago

Oh yeah, of course - will hopefully be able to pick this up and finish it in the next few days again - https://github.com/eslint/eslint/commit/97af7d512653b231db602403cc1d590f48de87b2

nzakas commented 7 years ago

Left some comments - overall looks good, just a few questions.

kaicataldo commented 7 years ago

I was able to look at this a little more tonight - feeling unsure of what the correct behavior is for comment attachment. The case I outlined above is a good example (copied and pasted here):

When using the strategy we've discussed above (calling getTokenOrCommentBefore() and getTokenOrCommentAfter() in a loop and stopping when comments are no longer found), the following examples produce the following outputs:

var zzz /*aaa*/ = 777;

This example attaches /*aaa*/ to the Identifier zzz as a trailing comment.

var zzz = /*aaa*/ 777;

This example attaches /*aaa*/ to the Literal 777 as a leading comment. Both of these behaviors make sense to me, however, it doesn't match the current behavior of Espree, as shown in the following example:

var zzz /*aaa*/ = 777;

Espree currently attaches the comment /*aaa*/ as both a trailing comment to the Identifier zzz and as a leading comment to the Literal 777, which means that we have to allow for continuing past certain punctuator tokens while calling getTokenOrCommentBefore() and getTokenOrCommentAfter() in a loop.

Do we only want to attach comments up until a non-comment token is found, both before and after the node? Or do we want to skip over certain punctuator tokens (such as the =, ===, and || in a VariableDeclaration, BinaryExpression and LogicalExpression, respectively)? Also should comments only be attached to the top-level node in a given location? i.e. In the following example, should a trailing comment be attached to the VariableDeclaration node but not be attached to the Literal or VariableDeclarator nodes, despite the comment being the next token after both children nodes:

//foo
var zzz /*aaa*/ = 777
//bar

I know it was mentioned here that the expected behavior is in the Espree tests, but the current behavior still seems buggy to me and I'm having difficulty parsing what the intended behavior is from potential bugs in how we currently attach comments (example: removing the semicolon at the end of this line leads to //foo being attached as a leading comment to the Literal 777, but not to its parent VariableDeclarator).

Retrieving comments by iterating through tokens seems easier to reason about and I think yields a more accurate result than how Espree currently does it (collecting comments as it parses and then attaching them as it finishes each node in the tree). This also appears to be how JSCS handles calculating the location of comments, if I understand this comment correctly. I guess what I'm ultimately trying to get at is a discussion of whether the current behavior is what we want to replicate as closely as possible (which I think would be difficult, given what I've outlined above), or if thinking about the location of comments via token list versus attached to nodes at the parser level frees us from some of the inconsistent/odd behavior outlined above and allows us to simplify things. Since this is already a breaking change, it seems like an opportunity to make things better! I really like the the algorithm @nzakas suggested here with the addition of the concept of children comments.

@nzakas Curious what you think of this - it sounds like your intent was to keep the same behavior we currently have. Hopefully this doesn't take us too far away from the original intent of the issue. Also happy to add this as a discussion point for the next TSC meeting, if you'd prefer a discussion in that setting.

Sorry for the wall of text!

kaicataldo commented 7 years ago

@eslint/eslint-team Any thoughts on my comment above?

nzakas commented 7 years ago

@kaicataldo my biggest concern about going too far from what Espree does is how that will affect custom rules. I like the new behavior you describe, as I think it makes more sense, but I'd like to see what sort of trouble that would get us into with existing custom rules. Maybe check against the react and import plugins to verify?

BTW, when posting a wall of text, its helpful to call out specific questions in bold and/or provide a tl;dr so it's easier to scan. I know some days I see a wall of text and just don't bother because I don't have the energy to focus and try to pick out the pieces that need a response.

Consequently, if I missed a question or two, feel free to let me know. :)

kaicataldo commented 7 years ago

Yeah, sorry about that - fair points about the wall of text, and appreciate you reading through it all. I'll do some exploration into this and report back!

kaicataldo commented 7 years ago

Having worked on this some more, I'm feeling more and more like we should try to get the same result from attaching comments in ESLint as we did in Espree. Even though my suggestion above might make more sense from an expected behavior standpoint, I don't think the benefits outweigh the risk of breaking plugins in the ecosystem.

I'm going to try and get it as close as I can to the current location of attached comments in Espree - might be a little tricky, because the way in which it's done in Espree is a totally different strategy (collecting comments and attaching them during parse vs iterating over token lists post-parse).

platinumazure commented 7 years ago

This is a bit out there, but... Could we support multiple comment attachment "strategies" and let users basically pick one?

On Oct 15, 2016 11:08 PM, "Kai Cataldo" notifications@github.com wrote:

Having worked on this some more, I'm feeling more and more like we should try to get the same result from attaching comments in ESLint as we did in Espree. Even though my suggestion above might make more sense, I don't think benefits outweigh the risk of breaking plugins in the ecosystem. I'm going to try and get it as close as I can to the current location of attached comments in Espree - might be a little tricky, because the way in which it's done in Espree is a totally different strategy.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/eslint/eslint/issues/6724#issuecomment-254026277, or mute the thread https://github.com/notifications/unsubscribe-auth/AARWeqErqDOybGrPAXrFa-aVE6p2cE3Aks5q0aNKgaJpZM4JRHUD .

mikesherov commented 7 years ago

One of the things I think JSCS did better than ESLint was how it treated comments. If we're going to be moving comments into ESLint, and we're at least partially doing that to take the responsibility away from parsers, now seems like a good time to at least deprecate trailingComments and leadingComments?

Getting them out of the AST also means we can accept any ESTree AST as long as it also provides an array of commments.

One way to do this deprecation is to only lazily generate leading and trailing comments by turning those properties into getter functions. We could provide a deprecation warning when those accessors get hit. Getters are typically slow, so I think speed measurements would be important here.

kaicataldo commented 7 years ago

@mikesherov Just to be clear, the proposal in this issue is turn off comment attachment and to move all this logic into sourceCode.getComments() which, when given a node, returns the comments for that node. The general strategy I've been looking into is outlined in this comment. As you can see from this giant wall of text, there are some challenges trying to recreate the exact same result we currently get with Espree, and I was hoping to simplify it.

It looks like the corresponding method in JSCS would only return comments that are either the current token or follow the current token, which, as you mentioned, is very different from the current strategy of attaching comments in Espree.

Could you speak more to what you think the benefits of doing it the way JSCS does are?

mikesherov commented 7 years ago

@kaicataldo, I think JSCS's way always acknowledged the lossy nature of the AST, and never attempted to attach comments to nodes.

You're starting to hit those limitations in doing your work, like you pointed out with var a /**/ = zzz;, but it's not about skipping punctuators, it's about the fact that individual AST nodes often contain more than one piece of syntax, and so the paradigm of leading, trailing, and even children is intellectually muddy. Consider:

/*1*/ class /*2*/ a /*3*/ extends /*4*/ b /*5*/ { /**/ } /**/ ; /**/

Conceptually, which of those comments should "leading" the classBody?
If you go the route of child comments, isn't it a bit inconsistent for that child comment to longer be a child once a class property is created to attach it as "leading" to?

Thoughts?

nzakas commented 7 years ago

@platinumazure no, users will have no idea how this will affect ESLint. This isn't something we should push to the users.

@mikesherov part of our complexity is that we have always traversed into comments just like we do other nodes, so you can say LineComment and BlockComment as visitors just like FunctionDeclaration. If we get rid of the concept of comments belonging to nodes completely, then that pattern breaks. That's the main impetus for keeping some concept of leading/trailing vs. moving to something more like was JSCS does (which I have no philosophical concerns about).

So the larger question is if we go the way of JSCS comment handling, how do we deal with the visitor pattern we've always supported for comments in rules? Is this a big enough improvement that we remove this capability and accept that this will cause pain for people with custom rules? Or do we find some way to still support this pattern?

mikesherov commented 7 years ago

@nzakas, thanks for the background info. I'm going to speak to @kaicataldo in person about this tomorrow... we work together :-)!

I'd like to understand some of the rules that use comment visitors so I can understand how tying them back to nodes is useful. My naive take is that we'd still be able to "visit" comments, but we'd then use ranges to figure out what syntax they relate to instead of traversal.

nzakas commented 7 years ago

@mikesherov ah nice, totally forgot you guys worked together. :)

There are a bunch of rules in core that use this functionality, if you search for "LineComment", you should find them. I did a cursory look, and it does look like many of those rules would be just as good using Program.comments and iterating over those instead, so maybe it's not that a big of a deal. Of course, we don't really know what people are doing in the wild.

kaicataldo commented 7 years ago

I'm continuing to slowly but surely work on this. After talking this over with @mikesherov, I'm not sure the concept of children comments really holds up. I think it's best if we start by trying to mimic the current behavior (while eliminating some bugs I've found in the process)/follow @nzakas's initial recommendation, and once we get that all working, we can evaluate and see if we want to make changes on top of that.

Here's a branch with the comment getting happening in ESLint rather than Espree with all the tests for util/source-code.js passing.

nzakas commented 7 years ago

@kaicataldo thanks for the update. Did you and @mikesherov talk about how JSCS handled comments and if it could make sense for us to just not traverse into comments and rely on getCommentBefore() in rules instead?

mikesherov commented 7 years ago

@nzakas, we talked over the fact that the current leading and trailing behavior is currently pretty buggy, and that pursuing the simpler strategy might lead to breaks in userland that rely on the buggy behavior.

That's why @kaicataldo is breaking it into two parts. First, we're doing what the original intent of this issue is: get rid of leading/trailing Comments attach to nodes.

Once that's been clarified a bit, we intend to use bigQuery to sample how userland uses these properties/functions, and will survey popular plugins so we have a bit of data to guide us on next steps that may be more drastic.

kaicataldo commented 7 years ago

Progress! I have a complete PR (mimics the Espree attachment strategy, all tests passing) open here: https://github.com/eslint/eslint/pull/7516

not-an-aardvark commented 7 years ago

Fixed by https://github.com/eslint/eslint/pull/7516

eslint / eslint

Calculate leading/trailing comments #6724