commonmark / commonmark-spec

CommonMark spec, with reference implementations in C and JavaScript
http://commonmark.org
Other
4.87k stars 313 forks source link

Underscores inside of emphasis. #51

Closed neongreen closed 9 years ago

neongreen commented 10 years ago

Why is it impossible to emphasise strings beginning or ending with an underscore? I'm using this input for testing:

blah *_*

blah *_x*

blah *x_*

blah *__*

Here is the comparison. Most implementations handle all 4 cases correctly (as it seems to me), but pandoc (in strict mode) and stmd produce this:

<p>blah *_*</p>
<p>blah *_x*</p>
<p>blah *x_*</p>
<p>blah *__*</p>

Have I misunderstood the spec, or is it a bug?

palant commented 10 years ago

The rules under http://jgm.github.io/stmd/spec.html#can-open-emphasis seem pretty clear, there is no reason why *_x* shouldn't be an emphasis - seems to be a bug in the reference library.

neongreen commented 10 years ago

The C implementation handles -all cases but the 3rd- incorrectly as well:

<p>blah *_*</p>
<p>blah *_x*</p>
<p>blah <em>x_</em></p>
<p>blah *__*</p>
jgm commented 10 years ago

This is a great case. I think this bug is a result of an optimization the parsers are doing to avoid backtracking. They hit the _ and say, okay, let's try open emphasis. They parse forward looking for a closing _. When one isn't found, instead of backtracking to the position after the opening _, they emit the whole string of inlines parsed -- including the closing *! But this has the result that the closing * is never recognized as a closer.

So, the proper optimization requires more thought. Thank you.

neongreen commented 10 years ago

@jgm I actually caught it while implementing an emphasis parser in Haskell – I was testing my implementation, found that it failed on *_* *_x* *x_* *__*, and was pretty surprised to see that stmd was failing on these 4 in the same way.

Here is a parsing strategy I think might work:

Let's maintain a stack of opening patterns and add sequentially parsed inlines on the top of the stack; additionally, for each opener we store the set of patterns which can close either it or any of the preceding openers.

type Opener = ('*' | '_',  1 | 2 | 3)
type Closer = ('*' | '_',  1 | 2)
type StackElem = (Opener, Set Closer, Inlines)
type Stack = [StackElem]

E.g. here is the state of the stack after parsing *foo bar **baz:

stack = («**», {«**», «*»}, "baz")  :  («*», {«*»}, "foo bar")  :  []

After stumbling on something which looks like a possible closer, we first check it against the set of possible closers which is stored in the top element of the stack – if it is in the set, we can start trying to match it against each opener in the stack. An opener which doesn't match the current closer can be safely converted into an inline, so the algorithm should be O(N); we're also guaranteed to find a match eventually, due to the preliminary set check (which has to be O(1) with regard to the number of previous openers, because otherwise something like _a* _a* _a* _a* _a* ... might take O(N²) time due to each * being checked against all _s).

Unfortunately, this seems to be further complicated by the presence of ***s; however, it's hard to say what amendments would be needed, because the spec is currently not very clear about them. For instance, why is ***foo** (example 323) not parsed as *<strong>foo</strong>? It seems to depend on the “*** or ___ have to be closed twice or not closed at all” rule, but it's not in the spec.

Frankly, devising an efficient parsing strategy is much easier when you're also in control of the corner cases of the -specification the parsing strategy has to adhere to-. I believe that if examples 321 (***foo*) and 323 (***foo**) are specified to be parsed as **<em>foo</em> and *<strong>foo</strong> respectively, a) the correct strategy would be easier to implement, and b) the rules would be more intuitively understandable.

jgm commented 10 years ago

I agree that example 323 is wrong. So, that's another problem with nested strong/emph parsing (which has been devilishly difficult to get right without backtracking).

I was thinking that, at least in the js parser, it would be simple to add memoization to the InlineParser object. Then the parsing algorithms could use more straightforward backtracking, without the slowdown this would usually cause. It would work this way: each time we parse an inline, we store it in a table in the InlineParser object, together with its start and end location in the subject. parseInline can then simply check this table before doing anything else; if it finds something that starts at the current location, it returns that inline and adjusts the position to the end location we stored. So, although there would be backtracking, it would not cost much because the work done parsing could be reused.

This could be done in the C parser, too, though it would be a bit more complex to implement.

jgm commented 10 years ago

I'm having some success with the memoization approach in the memoize branch. So far, I've only converted emphasis parsing to the new approach, but it seems to work well, and the logic is certainly easier to follow. Also, ***hi** is parsed in the way the spec says it should be (as *<strong>hi</strong>). More later.

jgm commented 10 years ago

I have now redone all the inline parsers in the memoize branch. Unfortunately, performance is way down. I need to look into this.

jgm commented 10 years ago

It turns out the performance change had nothing to do with the memoizing. It had to do with a change I made in parsing of ordinary strings. When parseString is changed to allow runs of characters including spaces, performance shoots back up. I had to make this change to handle the two-spaces-at-line-end-makes-linebreak rule. Previously parseNewline could inspect the previously parsed inlines and see if there were two spaces, then edit these inlines to add a line break instead. Now that the inline parsers simply return an inline, instead of adding to a list, that's no longer possible. It may be possible to revert to the original behavior, or maybe there's another solution.

jgm commented 10 years ago

With a slight tweak to reMain, 7eeae755954191517829d29bd69132ed69fe42b1, we're back up to old, good performance.

Knagis commented 10 years ago

I did implement something like @ArtyomKazak described - a stack of potential emphasis openers that I review each time I get a potential closer. Since the C# implementation was originally line-by-line ported from stmd C version, you might find this fix helpful. Unless the input data is very specifically targeted, it does not have any performance hits. The nice thing is that I only had to change handle_emph_strong method (and a small change in parse_inlines_while plus two fields on subject) - the rest remained the same. Edit: There is another commit after the one above where I changed the stack to a queue - see the next post for the test example.

Knagis commented 9 years ago

@jgm - I had written here a description (as you asked in the response for #127) of the algorithm I used for the emphasis parsing that could be implemented at least in the C version. But that comment is gone for some reason (I would not have deleted it myself). Would it help you if I wrote it again?

jgm commented 9 years ago

@Knagis - I just noticed your message. I don't know what became of the comment you say you wrote. I don't recall ever seeing it. I would still be interested in a description of the algorithm. Perhaps it is better than the approach I'm currently taking in the js implementation.

Knagis commented 9 years ago

@jgm - I decided that it will be easier to write code so I submitted pull request #153 (for the newemphasis branch since I can't run the tests on my current environment)