Open MichaelAquilina opened 9 years ago
The fact that you can't match newlines in TextMate grammars was a conscious decision. In very early stage the parser splits input source text into lines and then starts matching rules line by line, only carrying over the metainfo about the scopes that are currently active that come from begin/end.
The main reason why this is done is because of performance - most of the time in text editor you work on one line at a time, if you edit something in the middle of the file, then the only parts that have to be processed again with TextMate grammars are the line you are working on and anything that comes after it (think of opening block comment in C-like languages).
If matching multiple lines were possible, you'd also have to process lines that come before the line where you changed something. In addition to that the inputs for regular expression matching will be longer, which take longer to process etc...
I hope this explains in enough detail why it was done this way.
@Ingramz, I'm sure this limitation helps with performance but it does make writing more complicated rules for certain languages nearly impossible. I think breaking up the source text into "lines" is definitely a good way to go but as someone who writes a grammar for one such language, Fortran, it would be incredibly useful to be able to define what constitutes a line for this purpose (see issue #42). Would giving the grammar authors the ability to override \n
as the de-facto line terminating sequence cause significant problems?
@tomedunn I guess there are very few useful grammars that you can write without any complications using this method, so anyone who has attempted to write a TextMate grammar has probably felt your pain (and cursed a lot).
But I agree, it would be useful, if you could define the points from which the previously parsed input won't be changed in case the part that follows it, changes. However I am not sure how it would affect performance.
@Ingramz thanks for the detailed explanation. I had assumed something along those lines but I wanted to double check first.
Do you reckon being able to correctly highlight inline SQL in strings with the following format is possible with a bit of hacking/trickery?
mystring = """
SELECT *
FROM bla
"""
Current only, the grammar will only highlight if the string is written is as follows:
mystring = """SELECT *
FROM bla
The related grammar is currently defined as:
{
'begin': '(""")(?=\\s*(SELECT|INSERT|UPDATE|DELETE|CREATE|REPLACE|ALTER))'
'beginCaptures':
'1':
'name': 'punctuation.definition.string.begin.python'
'comment': 'double quoted string'
'end': '((?<=""")(")""|""")'
'endCaptures':
'1':
'name': 'punctuation.definition.string.end.python'
'2':
'name': 'meta.empty-string.double.python'
'name': 'string.quoted.double.block.sql.python'
'patterns': [
{
'include': '#constant_placeholder'
}
{
'include': '#escaped_char'
}
{
'include': 'source.sql'
}
]
}
\s should make newlines according to standard regex, but this obviously wont work due to what we've discussed. Is there any way to work round this?
You can do this by adding a nested pattern like this:
{
'begin': '(""")'
'beginCaptures':
'1':
'name': 'punctuation.definition.string.begin.python'
'comment': 'double quoted string'
'end': '((?<=""")(")""|""")'
'endCaptures':
'1':
'name': 'punctuation.definition.string.end.python'
'2':
'name': 'meta.empty-string.double.python'
'name': 'string.quoted.double.block.sql.python'
'patterns': [
{
'begin': '\\G(?=\\s*(SELECT|INSERT|UPDATE|DELETE|CREATE|REPLACE|ALTER))'
'end': '(?=\\s*""")'
'patterns':[
{
'include': 'source.sql'
}
]
}
{
'include': '#constant_placeholder'
}
{
'include': '#escaped_char'
}
]
}
The nested rule for including source.sql
will only match if it is the first match in the quoted string.
Thanks for that @tomedunn! The only tweak it needed is to remove the \G as this seemed to be causing problems. I couldnt find what this even represented in regex so I was hoping you could explain its purpose? (I only found \g)
I opened a pull request in language-python to apply this fix just fyi. I applied the same concept to single quoted blocks too: https://github.com/atom/language-python/pull/92
@MichaelAquilina, glad you got it working. I probably should have tested it first. The regex \G
will work so long as it's on the same line as the encompassing begin
match. So it wouldn't work for your second case where the SQL statement starts on a different line than you begin
match. So simply removing the \\G
from the rule I put should work instead as it seems you've discovered. I have a number of tricks using \G
in the language-fortran grammar for Atom.
@MichaelAquilina The \G
is an extension to the regular expression syntax added by the Oniguruma engine, which is what Atom/TextMate use to highlight grammars. It's a zero-width anchor that points to the start of whatever was last matched. Source.
Anyway, I stumbled across this while wracking my head over an issue I still haven't figured out, and can fully confirm that having a way to match patterns over multiple lines would be a HUGE help when authoring language grammars:
I understand the potential to impact performance: isn't there a way of leveraging (?m)
to enable multiline parsing?
Brought this up on the atom slack channel and was advised to post this here. Matching text from "begin" (and probably "end" too) currently seems to be impossible across multiple lines.
I noticed this when trying to fix the inline SQL highlighting in multiline python strings after noticing the following was not highlighting correctly:
Looking at the defined grammar, this should technically work as \s matches to newlines in the regex spec.
Adding a \n alongside \s does not work either.
It seems to be a potential design flaw in the original TextMate spec as I found someone stating the following [here]:
Is there any way this could be fixed for atom first-mate?