Handling non-ASCII whitespace.

elland commented 10 years ago

Whitespace is defined in the parser as either `or\t`, foregoing a large list of non ASCII whitespace characters.

I propose to append the missing whitespace characters to the spacechar definition here.

fletcher commented 10 years ago

Are there other Markdown implementations that do this?

elland commented 10 years ago

We do full unicode multi-language support on iA Writer and Writer Pro and we integrated MultiMarkdown recently.

fletcher commented 10 years ago

My question is whether any other Markdown implementations support all of these whitespace characters.

Adding them to the whitespace expression would result in these characters being stripped from the document, which is probably not what one wants when using a non-breaking space, for example.

No one has ever asked for this (to my knowledge) in the years that peg-markdown, then peg-multimarkdown, and now MultiMarkdown-4, have been available. And perhaps I missed it, but I'm not aware of any similar requests going to the markdown discussion lists over the years.

Under what real life situations do you envision this coming up?

elland commented 10 years ago

Oh, I see. We noticed this issue when parsing Japanese text, for instance:

　山本

Wouldn't parse as a header, because there's no `(regular whitespace) character, only　` (ideographic whitespace). As for this never being request, apparently Markdown isn't been 'big in Japan', as far as we can tell.

:+1:

fletcher commented 10 years ago

Can you verify that what you entered is correct? When I copy your example and parse it with MMD, it works fine.

elland commented 10 years ago

@fletcher Sorry I didn't get back to you earlier, I was on holiday. Turns out I mixed up the issues. It's not that it doesn't render the heading properly, but rather that it doesn't strip the double-byte whitespace character.

screen shot 2014-07-22 at 10 42 30 Here you can see the double-byte space still as part of the resulting HTML.

screen shot 2014-07-22 at 10 46 26 But when using a Latin script, the single-byte whitespace is stripped.

As you mentioned earlier, adding the ideographic, double-byte, whitespace to that list would strip it elsewhere in the document, definitely not the wanted behaviour.

Is there some way to strip the leading double-byte whitespace there, as well as for lists etc, but not everywhere in the document? If it's not possible, we're fine with trying an iA Writer-only workaround, but we'd be happier if we could stick to standard MMD instead of diverging from it.

fletcher commented 10 years ago

I don't know that inconsistently treating these characters as whitespace at times and as characters at other times is a good solution. It is likely to lead to unintended consequences down the road.

If the trouble is an unwanted space at the beginning of a header, then simply use a regular space or don't use a space at all.

Lists would be a different story, as the space character is required in that situation.

But again, I repeat my question:

To my knowledge no other Markdown implementation has been coded to recognize the distinction between "regular" whitespace and "double-byte" whitespace. Looking at babelmark, only 2 implementations appear to handle this "properly" by stripping out the leading space, but presumably they do the same in other places where it might not be desired and may be "wrong". (I am not even remotely an expert on Japanese, so have no idea what is right and wrong...) Surely all the Japanese users of Markdown over the years have not been avoiding lists because they couldn't type a space character. As someone who doesn't speak Japanese, Chinese, or Korean, and cannot reliably tell the scripts apart visually, all I know is that there are many users of (at least some of) these languages using Markdown and MultiMarkdown because they post on Twitter and the web about it.

This would not seem to be a MultiMarkdown question/issue, but rather a general Markdown issue. I still don't understand why this is just coming up now, and has never (to my knowledge) come up for the Markdown world in general. What did these users do when using other forms of Markdown besides MultiMarkdown?

It seems that discussing it on the Markdown discussion list would be better than this support forum to get a wider variety of input before making a change that might be worse than the current problem.

fletcher commented 10 years ago

(But to speak to the technical side of your question -- yes, it would be possible (and relatively easy) to define two distinct types of whitespace, and to use one at the beginning of headings, list items, blockquotes, etc. I'm just not sure if it's a good idea, or trading one problem for another.)

elland commented 10 years ago

it would be possible (and relatively easy) to define two distinct types of whitespace, and to use one at the beginning of headings, list items, blockquotes.

What I don't follow is why it's not like that to begin with, and why all whitespace is replaced. I don't see why ideographic spaces are not counted as markup delimiters the same why \t and \s are.

Also, how would it be worse?

fletcher commented 10 years ago

That's like asking why everyone doesn't just stick with "regular" whitespace characters.

Sent from my iPhone

On Jul 23, 2014, at 10:03 AM, Igor Ranieri Elland notifications@github.com wrote:

it would be possible (and relatively easy) to define two distinct types of whitespace, and to use one at the beginning of headings, list items, blockquotes.

What I don't follow is why it's not like that to begin with, and why all whitespace is replaced. I don't see why ideographic spaces are not counted as markup delimiters the same why \t and \s are.

Also, how would it be worse?

— Reply to this email directly or view it on GitHub.

elland commented 10 years ago

Pardon?

fletcher commented 10 years ago

As you probably know, "back in the day" there was only ASCII. There was `,\t,\n, and\r, and that was about it for whitespace. Markdown was originally written with support for those whitespace characters. Markdown, and the initial Markdown derivatives (e.g. PHP Markdown and MultiMarkdown) were written 9+ years ago. Unicode/UTF-8/UTF-16/UTF-1024/whatever weren't a big deal then. At the time, I believe the Perl\sregular expression would only match those basic whitespace character types(the original Markdown is written in Perl). Markdown and the other early derivatives were written by people who primarily spoke English (and a bit of French.) I don't think anyone involved was thinking about supporting fifteen different kinds of whitespace characters. Non-breaking space meant `, not a special unicode character -- this would clearly not be used at the beginning of a header or list item.

I get that there are languages that use 15 kinds of white space. That's fine. However, this was not built into the original design of Markdown, or its "descendants." It's not clear to me what the best approach to reconcile this is.

As I've said, however, this is not a MultiMarkdown-specific issue. This is an issue that relates to all flavors of Markdown. It's a worthwhile discussion to have, just not here.

If this is an important issue to you, I encourage you to bring it up on the Markdown discussion list (not the MultiMarkdown discussion list) and get a wider range of input. There may be downstream consequences from any proposed solution, and it would best to figure that out in advance.

The fact that this issue (to my knowledge) has not come up before suggests that other people already have a solution (e.g. using regular space characters in Markdown documents?), but perhaps not. Even so, there may be a better solution that can be implemented.

This is not the place to hash that out.

(Alternatively, since MultiMarkdown is open source, you can also skip the discussions and modify the source to do whatever you and your users want. Just be aware that this could break documents in other ways. If it's important enough to change, I would propose it's important enough to change properly.)

fletcher / MultiMarkdown-4

Handling non-ASCII whitespace. #79

山本