demydd / pandoc

Automatically exported from code.google.com/p/pandoc
0 stars 0 forks source link

Unexpected results from html/markdown conversions for arrow character. #96

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Command:

echo "<p>This is a line of &lt; text and more &gt; text.</p>" | pandoc -f html 
-t markdown --
strict | pandoc -f markdown -t html --strict

Expected output:

<p>This is a line of &lt; text and more &gt; text.</p>

Actual output:

<p>This is a line of \<text and more> text.</p>

Seems like pandoc gets escaping of "<" wrong? - and also eats up space to ">".
Removing the --strict from the last conversions gives me the expected result.

pandoc 0.46 on FreeBSD 7.0.

Original issue reported on code.google.com by mar...@925.dk on 22 Oct 2008 at 8:49

GoogleCodeExporter commented 8 years ago
By the way, the reason I use --strict is only to prevent the insertion of 
JavaScript in the HTML output (to obscure 
email addresses I assume, it is really not documented anywhere?). Would perhaps 
be nice with an cmd line option 
for disabling this.

Original comment by mar...@925.dk on 22 Oct 2008 at 8:59

GoogleCodeExporter commented 8 years ago
Conversion from html to markdown seems fine here. The problem is in the 
conversion
from markdown to html with --strict:

% pandoc --strict
This is a line of \< text and more > text.

<p
>This is a line of \<text and more> text.</p
>

I'll look into it.

Your request for more command-line control over email obfuscation is a 
reasonable
one.  Could I ask you to file a separate issue for that, so I can keep track of 
it?

Original comment by fiddloso...@gmail.com on 22 Oct 2008 at 9:32

GoogleCodeExporter commented 8 years ago
Thanks for your quick comment.

Ok, but according to http://daringfireball.net/projects/markdown/syntax (see 
bottom of page) - it seems that < 
does not need escaping to \< in markdown.

Original comment by mar...@925.dk on 22 Oct 2008 at 9:44

GoogleCodeExporter commented 8 years ago
Sure enough.  Escaping rules are somewhat different in pandoc than in strict
markdown: see http://johnmacfarlane.net/pandoc/README.html#backslash-escapes

So in --strict mode, pandoc's markdown writer should not backslash-escape '<'. 
That's one part of the problem. The other part is that pandoc is parsing '< 
text and
more >' as an HTML tag (that's why the spaces disappear).

Original comment by fiddloso...@gmail.com on 22 Oct 2008 at 11:01

GoogleCodeExporter commented 8 years ago

Original comment by fiddloso...@gmail.com on 2 Nov 2008 at 5:03

GoogleCodeExporter commented 8 years ago
The escaping issue solved in r1600.

The whitespace issue is one I can't figure out how to address without making 
pandoc's
acceptance of html-ish tags less liberal.  (Currently it is not restricted to
official html tags, and it allows attributes without values, so your example 
looks
like a tag to it.)  So I'm going to leave this as it is for now.

Original comment by fiddloso...@gmail.com on 18 Jul 2009 at 7:16

GoogleCodeExporter commented 8 years ago
Understood - and thanks a lot for your continued work on improving pandoc, and 
for paying attention to the 
details.

Original comment by mar...@925.dk on 20 Jul 2009 at 7:02