atom / language-javascript

JavaScript language package for Atom
Other
194 stars 235 forks source link

Implement new indentation logic based on tree-sitter #594

Open chfritz opened 6 years ago

chfritz commented 6 years ago

Enhancement

Description

As discussed in various places, including https://github.com/atom/atom/pull/10384#issuecomment-174019550, there is an opportunity to implement a new indentation logic that heavily exploits the output from the new tree-sitter parser. This makes it possible to address a number of issues users have been seeing with indentation, including in other languages, e.g., https://github.com/atom/atom/issues/6655.

Status

I've started developing this in my sane-indentation package, and version 0.9.3 already use it. It works very well so far and without having gone through the above linked list of issues in detail, it seems to fix a good number of them already. I'm showing an example of its current indentation below.

Proposed Approach

I plan to complete the work for this new logic in the package for JavaScript. Once that is done, i.e., it fixes all known indentation issues, so long as they are not mutually exclusive (for instance, because they were just a matter of opinion), then I propose we discuss porting that code into this package (language-javascript). If and when that is completed, i.e., we all agree that this is the right approach, then it should be relatively simple to port this to other languages as well.

Example

foo({
    sd,
    sdf
  },
  4
);

foo( 2, {
    sd,
    sdf
  },
  4
);

foo( 2,
  {
    sd,
    sdf
  });

foo(2,
  4);

var x = [
  3,
  4
];

if (true) {
  foo();
  bar();
} else {
  foo();
  bar();
}

if (true)
  foo();
else
  bar();

const jsx = (
  <div
    title='start'
    >
    good
    <a>
      link
    </a>
    <i>
      sdfg
    </i>
    <div>
      sdf
    </div>
  </div>
);

const two = (
  <div>
    <b>
      test
    </b>
    <b>
      test
    </b>
  </div>
);

const x = {
  g: {
    a: 1,
    b: 2
  },
  h: {
    c: 3
  }
}

/* multi-line expressions */
req
  .shouldBeOne();
too.
  more.
  shouldBeOneToo;

const a =
  long_expression;

b =
  long;

b =
  3 + 5;

/**
  Comments
*/

while (mycondition) {
  sdfsdfg();
}
chfritz commented 6 years ago

Feel free to assign this to me.

lee-dohm commented 6 years ago

@maxbrunsfeld Do you want to weigh in on this?

maxbrunsfeld commented 6 years ago

@chfritz This sounds really great; I'm glad that you're working on this! I like the way that you've already structured your package to separate the per-language configuration from the core logic. I always envisioned some system like this.

I propose we discuss porting that code into this package (language-javascript)

I think that once we're confident, we should try to generalize the system to one or two other fairly different languages, like Python, HTML, or Ruby. Then, once that is working fairly well, we should port the core logic to Atom core, and the configuration to the language packages.

In atom core, we could look for an optional indent-config section in the grammar. If it's there, we could use your new logic, and if it's not, we could fall back to the old logic.

chfritz commented 6 years ago

Sounds good @maxbrunsfeld .

One issue I'm struggling with right now is when the part of the program I'm in is in a state of invalid syntax. The tree-sitter parser then fails for this part. A part here is any section of code that starts at the file scope (indentation zero).

For instance, the following would result:

if (myTest) {
  const all = "good";
}

if (myTest2) {
const failsToIndent = "because_there_is_no_closing_bracket";
const object = {
andAlsoThisfails: "for_the_same_reason";

For now, when parsing fails, my code suggests to stay at the indentation of the previous line, but that's brittle.

So my question is, does tree-sitter support a "loose" parsing of the code, like acorn-loose? On https://github.com/tree-sitter/tree-sitter it says that it is "Robust enough to provide useful results even in the presence of syntax errors" and I've looked at some of the links regarding error recovery in LR parsers, but I wasn't sure whether that robustness is already being "used" in atom right now. If so, meaning we'll have to deal with how it is now since error recovery is obviously hard, then it might make sense to indicate the broken syntax to the user (e.g., using red underlining). That would make it easier to recover and also help him understand why indentation may be wrong at that point.

Any comments on this? I'm sure you've thought about this problem a fair bit already, especially since it is also an opportunity (help finding syntax errors would certainly be appreciated by the user).

maxbrunsfeld commented 6 years ago

I wasn't sure whether that robustness is already being "used" in atom right now.

Yeah, it already is; that's the feature that basically allows the majority of the file to parse while other parts have errors.

it might make sense to indicate the broken syntax to the user (e.g., using red underlining).

Yeah, I've experimented with this in the past, but I haven't chosen to go that route. The main reason is that in some languages like C, because you can do such unpredictable things with the preprocessor, we'll never be able to parse everything without errors; instead, we have to rely on error recovery for presenting useful results to the user.

I could imagine an approach where, if there's an error at the start of the current line or the end of the previous line, you inspect the tokens within the error, and use those tokens to compute the suggested indent level. For example, if you're in an error, but the previous line ends with a { or = token, you might want to indent.

The good news is that even when there's an error, we still have tokens to work with; we don't have to look at the raw text, so we don't have to worry about { or = characters occurring within strings, or that sort of thing. It's not fully thought through, but I think it may be worth exploring.

chfritz commented 6 years ago

OK, so I've made a bunch of progress within the package. It now has support to Javascript (incl. JSX), HTML, Python, C and C++. The latter three could use some more testing, especially some of the C++11 features, which I'm not yet familiar with. But I think the overall structure is more or less complete and it is just a matter of configuration now, at least for these languages. You can check out how it currently indents by looking at the sample files in https://github.com/chfritz/atom-sane-indentation/tree/master/spec/fixtures. The current indentation in these files is how the package currently suggests it.

Are there any other languages you think we should try before we think about creating a PR into atom core? I haven't done Ruby yet because I don't know it and can't quite judge what's right from wrong. But if someone can provide me with a reasonably exhaustive sample file (exhaustive in the sense of containing all language features) that is already indented correctly, then I'd be happy to try it.

chfritz commented 6 years ago

@maxbrunsfeld I've gone ahead and created a PR in core and a corresponding PR in language-javascript with the configuration required for JavaScript. If and when that is reviewed and merged, I can port the configuration I already have for the other languages I'm listing above (and which are already in the sane-indentation package code).