No way to scope a grammar capture

dead-claudia commented 7 years ago

Description

There is no way to scope a grammar capture for specific grammar rules, leading to very odd, buggy-looking syntax highlights.

Steps to Reproduce

Set the language to any that has a rule that features both of the following. language-gfm and language-javascript both fulfill this requirement in their grammars.
- A begin and end capture
- A pattern that delegates to another language that also has a rule containing both a begin and end capture. (It doesn't have to delegate, though.)
Create something that matches that rule. For GFM, a code block tagged with css would suffice, and for JS, a tagged template starting with html would work.
Within that block, type something that would match the begin half of any rule in the other language that includes both a begin and end. Do not include the end. For GFM or JS, just start a comment in CSS or HTML, respectively.
Observe very odd syntax highlighting. It looks like a bug because it can't be contained, and repros 100% of the time.

If you followed the above repro, the code might look like this pre-syntax highlighting:

GFM -> CSS

```css

/* This is an unterminated comment

```



Some more text...

JS -> HTML

html`

<!-- This is an unterminated comment

`;



someOtherCall();

Here's a few screenshots of other code.

language-gfm -> language-javascript:

screenshot from 2016-10-24 23-47-59

language-javascript -> language-html

screenshot from 2016-10-24 23-52-48

Versions

OS: Ubuntu GNOME 14.04 (although it's been repro'd in Windows by others) Atom: v1.12.5 (been going on for quite a while)

Additional Information

Initially reported in atom/language-gfm#171, but I filed it here because it's language-agnostic.

winstliu commented 7 years ago

Thought about this for a while and I think I have a possible solution: adding a new key, something like alwaysMatchEndPattern which is false by default. When true, it will always match the end pattern, even if it would be otherwise matched by another pattern.

It's ironic that most of the time, we complain about not having enough context when developing grammars, but in this case, we're hit by the problem of knowing too much context.

dead-claudia commented 7 years ago

@50Wliu

Thought about this for a while and I think I have a possible solution: adding a new key, something like allowsMatchEndPattern which is false by default. When true, it will always match the end pattern, even if it would be otherwise matched by another pattern.

SGTM. Is my understanding correct in that this would be executed before the child grammar is run, so the child grammar can be run through just a predetermined slice?

It's ironic that most of the time, we complain about not having enough context when developing grammars, but in this case, we're hit by the problem of knowing too much context.

Yeah...really unusual. :laughing:

winstliu commented 7 years ago

Yeah, at least I hope so. I haven't looked at the implementation to see if it can be done reasonably though.

burodepeper commented 7 years ago

I am curious if this issue isn't better specified as issues with the offending grammars.

This issue can also be reproduced using a match in the delegated grammar, instead of a begin and end, see: https://github.com/burodepeper/language-markdown/issues/77#issuecomment-272770554 (this is similar to the optional parameter issue reported with JavaScript)

The offending grammar in this case is language-yaml which has the following rule (https://github.com/atom/language-yaml/blob/master/grammars/yaml.cson#L299-L302) which matches, well, unquoted strings. I am not too familiar with YAML, and there is probably a use for it, but it seems like a tricky thing altogether if you ever want to be able to 'escape' the YAML grammar again.

I can understand the replications with comments, and while they serve their purpose to showcase the issue, I think they're valid situations, because they do what they're supposed to do; the broken syntax highlighting shows that there's an error in your code, and above all, something that's easily fixable. Syntax highlighting is supposed to help you with this.

The issue with the optional parameter in JavaScript embedded in Markdown is a lot worse. So is the issue with YAML. It is valid code that appears to have errors. And I might be naive, especially since there are most likely a lot more of these issues, but I think they have to be fixed in their respective grammars.

winstliu commented 7 years ago

the broken syntax highlighting shows that there's an error in your code, and above all, something that's easily fixable.

The problem is that it isn't invalid code.

<script>console.log('hi!'); //</script>

is perfectly valid HTML, for example. HTML will always end the script at a closing tag, even if it would otherwise be commented by JavaScript.

And for Markdown...

```js
console.log('hi!');
/*
```

is also valid, because GFM always ends at the triple backticks, much like HTML.

burodepeper commented 7 years ago

@50Wliu You are probably right, and now I understand your alwaysMatchEndPattern solution. Thanks for clearing that up.

dead-claudia commented 7 years ago

Status?

winstliu commented 7 years ago

No progress. Implementing this is most likely way beyond the scope of my knowledge, and the rest of the team is busy working on other initiatives, such as faster startup time and improved editor rendering.

winstliu commented 7 years ago

Making progress. It's been slow since I've had to console log my way through every step, but here's where I've gotten to so far. Note that I'll be using the following example (language-gfm + language-javascript):

```js
/*
``` <-- THIS LINE

findNextMatch starts the process to, well, find the next match: https://github.com/atom/first-mate/blob/e06b550732b30e7d50991c6cab87b238a7282eeb/src/rule.coffee#L55
We retrieve a scanner for the base grammar: https://github.com/atom/first-mate/blob/e06b550732b30e7d50991c6cab87b238a7282eeb/src/rule.coffee#L60. At this point it's looking good, because language-gfm is language-javascript's base grammar on line 3.
getScanner uses the base grammar's included patterns: https://github.com/atom/first-mate/blob/e06b550732b30e7d50991c6cab87b238a7282eeb/src/rule.coffee#L36. This is where the problem occurs. language-gfm's only included pattern at this point is the ending */ from language-javascript.
The scanner is created using only the included patterns: https://github.com/atom/first-mate/blob/e06b550732b30e7d50991c6cab87b238a7282eeb/src/rule.coffee#L37.
findNextMatch proceeds to try to find a */ match and has no idea that language-gfm is also looking for ```.

So it looks like I might know enough now to try implementing the alwaysMatchEndPattern option.

(Also, the reason it took me so long to get here was because I initially missed the getScanner step, since it seems fairly innocuous, and spent way too long console logging the internals of scanner.coffee before realizing my mistake)

winstliu commented 7 years ago

Progress: always-match-end-pattern-progress

Unfortunately as you can probably see we aren't returned to the language-gfm grammar. At least we broke out of the comment though.

winstliu commented 7 years ago

got-it Super ugly implementation right now though. I'll try to clean up some of the logic and publish a PR so more experienced people can take a look at it.

Took a lot of persistence after implementing alwaysMatchEndPattern because of the way the tags are matched to the scopes. Also didn't help that the assert mechanism for reporting inconsistent tags throws on my computer...

dead-claudia commented 5 years ago

So sorry to throw a wrench into this, but a fix for this should probably be extended to also cover the scenario of isolating child grammars, so their $self doesn't reference the parent grammar. Filed https://github.com/atom/first-mate/issues/112 separately, but it might belong in this bug instead.

atom / first-mate