Support matches across multiple lines

MichaelAquilina commented 9 years ago

Brought this up on the atom slack channel and was advised to post this here. Matching text from "begin" (and probably "end" too) currently seems to be impossible across multiple lines.

I noticed this when trying to fix the inline SQL highlighting in multiline python strings after noticing the following was not highlighting correctly:

SELECT *
FROM bla

Looking at the defined grammar, this should technically work as \s matches to newlines in the regex spec.

Adding a \n alongside \s does not work either.

It seems to be a potential design flaw in the original TextMate spec as I found someone stating the following [here]:

"Bear in mind, however, that because of the way the TextMate parser surveys your document, all regular expressions used in grammar match rules must apply to a single line at a time. A single expression cannot embrace multiple lines. Thus it is possible to write a regular expression that appears to work in Rubular (or TextMate’s regex Find) but will fail as part of a grammar."

Is there any way this could be fixed for atom first-mate?

Ingramz commented 9 years ago

The fact that you can't match newlines in TextMate grammars was a conscious decision. In very early stage the parser splits input source text into lines and then starts matching rules line by line, only carrying over the metainfo about the scopes that are currently active that come from begin/end.

The main reason why this is done is because of performance - most of the time in text editor you work on one line at a time, if you edit something in the middle of the file, then the only parts that have to be processed again with TextMate grammars are the line you are working on and anything that comes after it (think of opening block comment in C-like languages).

If matching multiple lines were possible, you'd also have to process lines that come before the line where you changed something. In addition to that the inputs for regular expression matching will be longer, which take longer to process etc...

I hope this explains in enough detail why it was done this way.

tomedunn commented 9 years ago

@Ingramz, I'm sure this limitation helps with performance but it does make writing more complicated rules for certain languages nearly impossible. I think breaking up the source text into "lines" is definitely a good way to go but as someone who writes a grammar for one such language, Fortran, it would be incredibly useful to be able to define what constitutes a line for this purpose (see issue #42). Would giving the grammar authors the ability to override \n as the de-facto line terminating sequence cause significant problems?

Ingramz commented 9 years ago

@tomedunn I guess there are very few useful grammars that you can write without any complications using this method, so anyone who has attempted to write a TextMate grammar has probably felt your pain (and cursed a lot).

But I agree, it would be useful, if you could define the points from which the previously parsed input won't be changed in case the part that follows it, changes. However I am not sure how it would affect performance.

MichaelAquilina commented 9 years ago

@Ingramz thanks for the detailed explanation. I had assumed something along those lines but I wanted to double check first.

Do you reckon being able to correctly highlight inline SQL in strings with the following format is possible with a bit of hacking/trickery?

mystring = """
SELECT *
FROM bla
"""

Current only, the grammar will only highlight if the string is written is as follows:

mystring = """SELECT *
FROM bla

The related grammar is currently defined as:

{
        'begin': '(""")(?=\\s*(SELECT|INSERT|UPDATE|DELETE|CREATE|REPLACE|ALTER))'
        'beginCaptures':
          '1':
            'name': 'punctuation.definition.string.begin.python'
        'comment': 'double quoted string'
        'end': '((?<=""")(")""|""")'
        'endCaptures':
          '1':
            'name': 'punctuation.definition.string.end.python'
          '2':
            'name': 'meta.empty-string.double.python'
        'name': 'string.quoted.double.block.sql.python'
        'patterns': [
          {
            'include': '#constant_placeholder'
          }
          {
            'include': '#escaped_char'
          }
          {
            'include': 'source.sql'
          }
        ]
      }

\s should make newlines according to standard regex, but this obviously wont work due to what we've discussed. Is there any way to work round this?

tomedunn commented 9 years ago

You can do this by adding a nested pattern like this:

      {
        'begin': '(""")'
        'beginCaptures':
          '1':
            'name': 'punctuation.definition.string.begin.python'
        'comment': 'double quoted string'
        'end': '((?<=""")(")""|""")'
        'endCaptures':
          '1':
            'name': 'punctuation.definition.string.end.python'
          '2':
            'name': 'meta.empty-string.double.python'
        'name': 'string.quoted.double.block.sql.python'
        'patterns': [
          {
             'begin': '\\G(?=\\s*(SELECT|INSERT|UPDATE|DELETE|CREATE|REPLACE|ALTER))'
             'end': '(?=\\s*""")'
             'patterns':[
                {
                  'include': 'source.sql'
                }
             ]
          }
          {
            'include': '#constant_placeholder'
          }
          {
            'include': '#escaped_char'
          }
        ]
      }

The nested rule for including source.sql will only match if it is the first match in the quoted string.

MichaelAquilina commented 9 years ago

Thanks for that @tomedunn! The only tweak it needed is to remove the \G as this seemed to be causing problems. I couldnt find what this even represented in regex so I was hoping you could explain its purpose? (I only found \g)

I opened a pull request in language-python to apply this fix just fyi. I applied the same concept to single quoted blocks too: https://github.com/atom/language-python/pull/92

tomedunn commented 9 years ago

@MichaelAquilina, glad you got it working. I probably should have tested it first. The regex \G will work so long as it's on the same line as the encompassing begin match. So it wouldn't work for your second case where the SQL statement starts on a different line than you begin match. So simply removing the \\G from the rule I put should work instead as it seems you've discovered. I have a number of tricks using \G in the language-fortran grammar for Atom.

Alhadis commented 8 years ago

@MichaelAquilina The \G is an extension to the regular expression syntax added by the Oniguruma engine, which is what Atom/TextMate use to highlight grammars. It's a zero-width anchor that points to the start of whatever was last matched. Source.

Anyway, I stumbled across this while wracking my head over an issue I still haven't figured out, and can fully confirm that having a way to match patterns over multiple lines would be a HUGE help when authoring language grammars:

I understand the potential to impact performance: isn't there a way of leveraging (?m) to enable multiline parsing?

atom / first-mate

Support matches across multiple lines #57