evilstreak / markdown-js

A Markdown parser for javascript
7.69k stars 863 forks source link

parentheses in links cannot be parsed by regex #248

Open Pomax opened 8 years ago

Pomax commented 8 years ago

I was looking for a new markdown library, since chjj/marked is built entirely using regex, which means it completely fails at parsing URLs with parentheses in it (wikipedia's full of those, for instance), but looking at https://github.com/evilstreak/markdown-js/blob/master/src/parser.js#L179-202 it seems like this code is also using regex for inline pattern extraction.

If that is the case, then this parser also can't deal with URLs like https://en.wikipedia.org/wiki/Set_(mathematics) (github's parser gets URIs like these right), unless people manually replace the parentheses with %.. values (which is not a realistic solution). And as long as regex are used, there is no solution to that problem, so this might be something that needs to be put in the README.md as a "warning: there are limitations to this parser. [...]" so that people don't suddenly run into this problem but can plan around it.

1kastner commented 8 years ago

I am not sure whether I am not just getting your point but can't we find a regex to match them all? Instead of just catching alphanumeric symbols we could decide to match all symbols expect spaces.

You can just play here: https://regex101.com/#javascript and I found this one (http[\S]*) as a short and simple solution.

Pomax commented 8 years ago

the problem is catching the parenthesis. For instance:

[text1](http://moo.com/this has (a link)) and this is text ) with a ) parens
[text2](http://moo.com/this has (a (reasonable) link)) and this is text ) with a ) parens

This should match http://moo.com/this has (a link) as first URI, and http://moo.com/this has (a (reasonable) link) as second URI, but those paired parenthesis are a problem for regex, as they can't do nested pair matching (due to how regex works).

Writing a pattern so that any pairs (\([...]+\))? are matched would work, but ( and ) can also occur on their own in the URL, and now you have a properly hard problem.

It's definitely worth adding that simple pattern match, but the README.md would definitely need a booknote on how link parsing is guaranteed to fail for certain links, and how to mitigate that (using %... codes for the ( and ) characters if you have a truly problematic URL with either nested parens or only a single parens without its matching counterpart)

1kastner commented 8 years ago

Ok, this explains it, you can't count with regex ;-) Well, I am just a contributing user as well, so maybe just fork it and put a pull request for the repository administrators?

Pomax commented 8 years ago

yeah, thinking about that.