Links with underscores are parsed as emphasized text

HuemorDave commented 10 years ago

I have noticed that the Markdown parser gives incorrect results with underscores in both link titles and URLs. Example:

[http://example.com/a_link.html](http://example.com/a_link.html)

On most Markdown parsers (including Github's), you get:

http://example.com/a_link.html

Instead, evilstreak/markdown-js produces a parse tree roughly (Ignore the auto linking) equivalent to:

[http://example.com/a link.html](http://example.com/a link.html)

i.e. the range between the underscores is parsed as an em first, before the parser looks for links. The result being that the link doesn't actually become a link. I've told the client to escape the underscores as a workaround but I believe this is a parsing issue.

For the record I am using the latest version of the master branch compiled today, and I verified the issue by directly invoking the parser in a JS console. (e.g. without any of the filtering our front-end code does) Is anyone else able to reproduce this?

bteplygin commented 9 years ago

I have the same problem, within a single link.

todb-r7 commented 8 years ago

Sorry if I'm out of line, but is there a :+1: mechanism for bugs like this? Just ran into it. I'll take a run at fixing it myself, but in case I fail, wanted to make sure it got bumped.

In any event, a string_like_this should not be parsed out as a string<em>like</em>this. Ever. You can see the correct behavior in this very issue comment, when you compare a string_like_this vs a string like this.

todb-r7 commented 8 years ago

Alright, I have no idea how to fix this proper. If I were to be hacky about it, I'd start around here:

https://github.com/evilstreak/markdown-js/blob/9b8aa6598fcc8104ce4819f1190b2e2967bc7d2a/src/dialects/gruber.js#L541

and have a regex test where this_string is treated the same as this \_string

Essentially, it seems like you need a rule similar the the escaping rule, but taking into account any leading character that's non-whitespace. IOW, these should not get formatted: thisexample , _this_example, or_this_example.

In all cases, there's an underscore that's immediately preceded by an alphanum.

This is solved in other implementations, like GitHub flavored markdown (as you can see above).

mehaase commented 8 years ago

@todb-r7 I think :+1: is the only mechanism for voting on GitHub issues, unfortunately. Anyway, yes I noticed the same issue recently. Finally started debugging it today and that led me here.

This is one of the most insane code bases I've ever looked at, but I surmise that changing the behavior of underscores inside words is going to be very difficult, since the current implementation doesn't distinguish "start underscore" from "end underscore" in the same way that it distinguishes [ from ].

However, I did manage to fix the issue of parsing a link with underscores. Change the following line in DialectHelpers.inline_until_char() from

  var res = this.dialect.inline.__oneElement__.call(this, text.substr( consumed ) );

to

  var maxInline = text.indexOf(want, consumed);
  var inlineEligible;

  if (maxInline === -1) {
    inlineEligible = text.substr(consumed);
  } else {
    inlineEligible = text.substr(consumed, maxInline);
  }

  var res = this.dialect.inline.__oneElement__.call(this, inlineEligible);

The principle here is that the link parser will see a link like [foo_bar](/baz_bat.html) and it will call DialectHelpers.inline_until_char("foo_bar](/baz_bat.html)", "]", meaning that it should try to parse any inline elements in the text prior to ]. Unfortunately, the implementation ignores the terminating character ] and will return any inline elements through the end of the string, resulting in a parse tree where bar](/baz is inside an em node, the ] is never matched, and so it never produces any a node.

The fix is to not let this.dialect.inline__oneElement__ go past the ] character. This works in my limited testing, but I don't have the know-how or interest to run the full battery of unit tests. This project appears to be dead, so I am posting this in case it helps somebody else.

evilstreak / markdown-js

Links with underscores are parsed as emphasized text #198