fletcher / MultiMarkdown-4

This project is now deprecated. Please use MultiMarkdown-6 instead!
https://github.com/fletcher/MultiMarkdown-5
Other
306 stars 59 forks source link

japanese comma "、" breaks the parser #101

Closed Oleksiy-Yakovenko closed 9 years ago

Oleksiy-Yakovenko commented 9 years ago

example:

__aaa__、 __bbb__、 __ccc__

gives this:

<p><strong>aaa__、 </strong>bbb__、 <strong>ccc</strong>

while with regular commas:

<strong>aaa</strong>, <strong>bbb</strong>, <strong>ccc</strong></p>

github's markdown parser handles this well (look at the generated html):

aaabbbccc

I tried adding the character to punctuation list in the parser.leg, but that doesn't solve anything.

Please help.

Oleksiy-Yakovenko commented 9 years ago

The perfect solution would be if space is not required after the "、", because Japanese punctuation rules don't seem to require that, and usually the texts don't contain spaces.

fletcher commented 9 years ago

Use "*_" instead of "_". The "" syntax doesn't work in the middle of words (on purpose), therefore the need for whitespace/punctuation. The "_" syntax doesn't require whitespace/punctuation.

F-

On 3/2/15 10:39 AM, Alexey Yakovenko wrote:

The perfect solution would be if space is not required after the "、", because Japanese punctuation rules don't seem to require that, and usually the texts don't contain spaces.

— Reply to this email directly or view it on GitHub https://github.com/fletcher/MultiMarkdown-4/issues/101#issuecomment-76733959.

Fletcher T. Penney fletcher@fletcherpenney.net

Oleksiy-Yakovenko commented 9 years ago

Thanks for the info. I'll try that.

Oleksiy-Yakovenko commented 9 years ago

I tried the following:

  1. find&replace all __ with **, didn't work quite right obviously, esp. in the code blocks, etc
  2. change the mmd parser to treat in the same was as * mid-sentence, didn't work either, because there are many places in the text which use ""s mid-word, which become unwanted 's
  3. modified the mmd parser like this: http://pastebin.com/h6BzExNu -- this works very well, but I'm not sure if the NonPunctuation covers all alphanumerics, while excluding the rest.

The purpose of the Alphanumeric change is to exclude '。' and '、' from it.

This solution is better than nothing, even though there are other issues, but they can most likely be fixed 1-by-1.

Any chance that this fix can be applied upstream? Maybe with a cmdline option? (I can make a proper pull request if needed)

fletcher commented 9 years ago

Your fix wasn't quite right, but I looked into it. Thanks!

Oleksiy-Yakovenko commented 9 years ago

Thanks a lot!