fletcher / peg-multimarkdown

An implementation of MultiMarkdown in C, using a PEG grammar - a fork of jgm's peg-markdown. No longer under active development - see MMD 5.
Other
523 stars 55 forks source link

byte-order marker in UTF-8 files should be ignored #63

Closed jkallay closed 13 years ago

jkallay commented 13 years ago

Currently MMD header parsing fails when there's a byte-order marker at the beginning of a UTF-8 text file. It is valid to have a BOM as an indicator that the file is in Unicode, and this is what Google Docs outputs, for example.

fletcher commented 13 years ago

I am not an expert in Unicode, but by my understanding a BOM is meaningless in UTF-8, and is rarely used.

MMD has never checked for a BOM at the beginning of the file, and this is the first time it has come up in 6-7 years. I'll leave this open, but it's not on the top of my list of things to worry about. In fact, Markdown.pl doesn't handle this either.

If someone is concerned about this and patches the source, I'm perfectly willing to merge the changes. If John MacFarlane changes his code to handle this, then it would also get pulled into this project as well.

jkallay commented 13 years ago

Windows notepad adds the BOM when saving with UTF-8, and, as mentioned, so do Google Docs. Googlecl (the Google command-line tool) has a configuration setting to strip the BOM but it doesn't seem to work.

I don't consider the fact that an issue has not been reported in the past to be a reliable indicator of its severity, but if you are comfortable with knowing that anyone using Google Docs to create MMD files will not be able to use metadata without manually stripping the BOM, so be it.

fletcher commented 13 years ago

It's not so much a question of severity, but of frequency. This is the first time I've had someone report this issue. It looks like someone mentioned this on the Markdown list in 2007, and in 2008 PHP Markdown apparently added a fix. Markdown.pl itself still doesn't do anything special with a BOM. I'm not sure about other implementations.

Windows notepad is hardly an example of a program that "does the right thing.", and there are plenty of other editors that don't insert an unnecessary BOM to a UTF-8 file. Google Docs has it's own issues as well.

I'm simply trying to be realistic - it's not the first thing on my list to fix if it's caused a problem big enough for one person in the last 6 years to bring it to my attention. Especially if, IMHO, the real issue is a problem with the formatting of the input document.

And if you must use Windows notepad or Google Docs, you could always create/find a shell script or the like that strips out the BOM before passing it to MMD or other tools that don't expect the BOM to be present in UTF-8 documents.

jkallay commented 13 years ago

Severity and frequency are as orthogonal to one another as they are to "the right thing." ;)

The important thing is, as you note, that there's a workaround, and with the issue documented any of the tiny community of Google Docs and Windows users should be able to deal with it...

Thanks for the great tool.

-----Original Message----- From: fletcher Sent: Friday, May 27, 2011 6:35 PM To: yoni@kallay.net Subject: Re: [peg-multimarkdown] byte-order marker in UTF-8 files should be ignored (#63)

It's not so much a question of severity, but of frequency. This is the first time I've had someone report this issue. It looks like someone mentioned this on the Markdown list in 2007, and in 2008 PHP Markdown apparently added a fix. Markdown.pl itself still doesn't do anything special with a BOM. I'm not sure about other implementations.

Windows notepad is hardly an example of a program that "does the right thing.", and there are plenty of other editors that don't insert an unnecessary BOM to a UTF-8 file. Google Docs has it's own issues as well.

I'm simply trying to be realistic - it's not the first thing on my list to fix if it's caused a problem big enough for one person in the last 6 years to bring it to my attention. Especially if, IMHO, the real issue is a problem with the formatting of the input document.

And if you must use Windows notepad or Google Docs, you could always create/find a shell script or the like that strips out the BOM before passing it to MMD or other tools that don't expect the BOM to be present in UTF-8 documents.

Reply to this email directly or view it on GitHub: https://github.com/fletcher/peg-multimarkdown/issues/63#comment_1252745

fletcher commented 13 years ago

Fixed in latest development commit, thanks to John MacFarlane!