atom / first-mate

TextMate helpers
http://atom.github.io/first-mate
MIT License
91 stars 57 forks source link

Atom hangs on 100% if given a particular expression #70

Open Alhadis opened 8 years ago

Alhadis commented 8 years ago

This is somewhat difficult to explain, so please bear with me...

Last night while working on a grammar, I added a pattern-block that made Atom hang at 100% CPU usage. After force-quitting the app twice, I scrutinised the last expression for anything that might've triggered infinite recursion. I was drawing blanks until I narrowed it down from this...

    ((?:\\s*(?:'[^']*'?|[A-Za-z_∆⍙]+[\\w∆⍙¯]*|\\d+))+)
    \\s*
    ( ALPHA | BRAVO )
    \\s*
    ((?:\\s*(?:'[^']*'?|[A-Za-z_∆⍙]+[\\w∆⍙¯]*|\\d+))+)

... to this:

    ((?:\\s*(?:'[^']*'?|[A-Za-z_∆⍙]+[\\w∆⍙¯]*|\\d+))+)
    \\s*
    ( ALPHA )
    \\s*
    ((?:\\s*(?:'[^']*'?|[A-Za-z_∆⍙]+[\\w∆⍙¯]*|\\d+))+)

All of a sudden, Atom wasn't hanging whenever I switched to the tab that had the grammar's language in it. The patterns themselves didn't match anything. What the patterns were was irrelevant. What's important is that, somehow, this particular capturing group causes Atom to freeze.

I know how good First-Mate is at avoiding infinite loops, which is why this bug is concerning. I've isolated it to a separate branch in my repo for closer examination.

Reproduction

  1. Clone and checkout the repo's bug-branch:
    git clone https://github.com/Alhadis/language-apl.git && cd language-apl
    git checkout fatal-idioms
    apm link -d . && atom -d .
  2. Open tests/life.apl. It should be fine.
  3. Run make in the package's base directory to toggle the problematic capturing group.
  4. Close tests/life.apl and restart Atom
  5. Reopen tests/life.apl

I noticed Lightshow also has troubles with this particular ruleset, although its handling is somewhat less predictable than Atom's.

dead-claudia commented 7 years ago

TL;DR: Regular expression matching is slow, hard, and horribly complex behind the scenes. Oh, and you found a potential malicious regular expression resulting from this complexity.


You have discovered a potential malicious regular expression. That file would be a partial match of all of these, with multiple matches for each (note that (ALPHA|BRAVO) was made optional):

(?x: (?# Word AB Word  )
    ( (?: \\s* [A-Za-z_∆⍙]+[\\w∆⍙¯]* )+ )
    \\s* (ALPHA|BRAVO)? \\s*
    ( (?: \\s* [A-Za-z_∆⍙]+[\\w∆⍙¯]* )+ )
)

(?x: (?# Word AB Ints  )
    ( (?: \\s* [A-Za-z_∆⍙]+[\\w∆⍙¯]* )+ )
    \\s* (ALPHA|BRAVO)? \\s*
    ( (?: \\s* \\d+ )+ )
)

(?x: (?# Ints AB Word  )
    ( (?: \\s* \\d+ )+ )
    \\s* (ALPHA|BRAVO)? \\s*
    ( (?: \\s* [A-Za-z_∆⍙]+[\\w∆⍙¯]* )+ )
)

(?x: (?# Ints AB Ints  )
    ( (?: \\s* \\d+ )+ )
    \\s* (ALPHA|BRAVO)? \\s*
    ( (?: \\s* \\d+ )+ )
)

With each of these, there's a lot of loops involved that are hard to optimize away. Here's a quick diagram of just the last one as a DFA, the simplest by far, and excluding captures:

NFA flow chart for (\\s*\\d+)+\\s*(ALPHA|BRAVO)?\\s*(\\s*\\d+)+

Note that the character pointer advances on every `yes` unless it's on the last one.

   start
     |
     |----------------------------- advance start pointer,
     |                              move pointer to start
    \|/                                     /|\
 whitespace? ------------- yes --------------|
     |                                       |
     no                                      no
     |                                       |
    \|/                                      |
   digit? -------------- no --------------> end? --> yes -> no match
     |                                      /|\
    yes                                      |
     |                                       |
     |-------------------\                   |
    \|/        |         |                   |
   digit? --- yes        |                   |
     |                   |                   |
     no                  |                   |
     |                   |                   |
     |------------\      |                   |
    \|/           |      |                   |
 whitespace? --- yes     |                  /|\
     |                   |                   |
     no                  |                   |
     |                   |                   |
    \|/                  |                   |
   digit? ----- yes -----/                   |
     |                                       |
    \|/                                      |
     A -- no --> B -- yes ---\               |
     |           |           |               |
    yes         \|/         yes              |
     |           |           |              /|\
    \|/          |          \|/              |
     L --------- no -------- R               |
     |           |           |               |
    yes         \|/         yes              |
     |           |           |               |
    \|/          |          \|/              |
     P --------- no -------- A               |
     |           |           |               |
    yes         \|/         yes              |
     |           |           |               |
    \|/          |          \|/             /|\
     H --------- no -------- V               |
     |           |           |               |
    yes         \|/         yes              |
     |           |           |               |
    \|/          |          \|/              |
     A --------- no -------- O               |
     |           |           |               |
    yes         \|/         yes              |
     |           |           |               |
     |-----------------------/               |
     |                                      /|\
     |------------\                          |
    \|/           |                          |
 whitespace? --- yes                         |
     |                                       |
     no                                      |
     |                                       |
    \|/                                      |
   digit? --------------- no ----------------/
     |
    yes
     |
     |-------------------\
    \|/        |         |
   digit? --- yes        |
     |                   |
     no                  |
     |                   |
     |------------\      |
    \|/           |      |
 whitespace? --- yes     |
     |                   |
     no                  |
     |                   |
    \|/                  |
   digit? ----- yes -----/
     |
     no
     |
    \|/
string matched

(Captures only make this more complicated, so that's why I omitted them.)

And for your particular source, just trace it with that flow chart. See how long it takes each time. Also, try tracing it by testing your word regex word? == [A-Za-z_∆⍙]+[\\w∆⍙¯]* instead of digit? == \\d+, or even better, create a graph for just word? (it'll only get more complex).

I'll also note that this is far more optimized than what you'd get out of most engines: they usually just perform basic optimizations and use either on the fly code generation or just generate a tree as an NFA (theoretically equivalent to an DFA), since it takes linear time to create the NFA and polynomial time to match worst case, compared to the exponential time and memory best and worst case to go from the NFA to the DFA, just for linear time matching. (My above example is an NFA.)


Or, long story short, regular expression matching is slow, hard, and horribly complex behind the scenes.

Alhadis commented 7 years ago

First of all, I know damn well GitHub won't make this look as good as it does with my ligature-endowed editor font, but that didn't stop me from sprucing up your diagram with box-drawing glyphs:

   start
     │
     ├───────────────────────────── advance start pointer,
     │                              move pointer to start
     ↓                                       ↑̩
 whitespace? ───────────-> yes -─────────────┤
     │                                       │
    yes                                     nah
     │                                       │
     ↓                                       │
   digit? ────────────-> nah ────────────-> end? ─-> yes -> no match
     │                                       ↑̩
    yes                                      │
     │                                       │
     │<-──────────┬──────┐                   │
     ↓            │      │                   │
   digit? ───--> yes     │                   │
     │                   │                   │
    nah                  │                   │
     │                   │                   │
     ├────────────┐      │                   │
     ↓            │      │                   │
 whitespace? --> yes     │                   ↑̩
     │                   │                   │
    nah                  │                   │
     │                   │                   │
     ↓                   │                   │
   digit? -----> yes ────┘                   │
     │                                       │
     ↓                                       │
     A ── no --> B ── yes ───┐               │
     │           │           │               │
     │           │           │               ↑̩
     ↓           ↓           ↓               │
     L ──────-> nah ──────── R               │
     │           │           │               │
    yes          ↓          yes              │
     │           │           │               │
     ↓           │           ↓               │
     P ──────── nah ──────── A               │
     │           │           │               │
    yes          ↓          yes              │
     │           │           │               │
     ↓           │           ↓               ↑̩
     H ──────── nah ──────── V               │
     │           │           │               │
    yes          ↓          yes              │
     │           │           │               │
     ↓           │           ↓               │
     A ──────── nah ──────── O               │
     │           │           │               │
    yes          │          yes              │
     │           ↓           │               │
     ├───────────────────────┘               │
     │                                       ↑̩
     │<-──────────┐                          │
     ↓            │                          │
 whitespace? --> yes                         │
     │                                       │
    nah                                      │
     │                                       │
     ↓                                       │
   digit? ─────────────── no ────────────────┘
     │
    yes
     │
     │<-───────┬---──────┐
     ↓         │         │
   digit? --> yes        │
     │                   │
    nah                  │
     │                   │
     │<-─────────-┐      │
     ↓            │      │
 whitespace? --> yes     │
     │                   │
    nah                  │
     │                   │
     ↓                   │
   digit? ────-> yes ────┘
     │
    nah
     │
     ↓ 
string matched

Frivolous crap aside, I'm well-aware of ReDoS (and its slightly less dread-inspiring namesake, MS-DOS). I'm afraid your answer doesn't pinpoint exactly which code in First-Mate is at fault, though.

Alhadis commented 7 years ago

Also, I swear wherever first-mate is, @Ingramz is lurking in the shadows with his country's speedy internet. :') I gotta ask, what's your relationship with first-mate? Have you done work on its core code or something?

dead-claudia commented 7 years ago

@Alhadis To clarify, I'm saying that it's likely not first-mate's fault, or at least something they can't easily fix. It's more likely an issue with Oniguruma, the backing regexp implementation.

Ingramz commented 7 years ago

@Alhadis as a maintainer of a grammar package, I encounter a lot of issues regarding to this and some other packages so I follow everything that is going on at certain repositories.

As a person who has general interest in formal languages and parsing, I also appreciate a good automaton from time to time 😄.

My two cents on this issue is that it seems more like an implementation limitation that regex craftsmen and women have to take into account. Textmate grammars generally benefit from shorter-matching expressions however that's not always possible if the language is rather context sensitive. But if that expression is fast enough elsewhere (other regex engines), then it might be an issue on oniguruma's side that is worth looking into.

You might be able to work around that by matching the general case first, effectively limiting length of the input and then working on the shorter case with greater confidence, but that's what first-mate somewhat does already on the lines.