Pandocs output HTML looks jagged with tag-closing angle brackets on the next line

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Run pandoc on any markdown file

What is the expected output? What do you see instead?

Why does Pandocs output HTML look jagged like this with tag-closing angle
brackets on the next line:

<h2 id="et-harum-quidem-rerum-facilis-est-et-expedita-distinctio"
>Et harum quidem rerum facilis est et expedita distinctio.</h2
><p
>Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil
impedit quo minus id quod maxime placeat facere possimus, omnis voluptas
assumenda est, omnis dolor repellendus.</p
><ul
><li
  >Temporibus autem quibusdam et aut officiis debitis.</li
  ><li
  >Itaque earum rerum hic tenetur a <em
    >sapiente</em
    > delectus.</li
  ><li
  >Lorem ipsum dolor sit amet, consectetur adipisicing elit.</li
  ></ul
>

rather than how it looks after running tidy on it?

<h2 id="et-harum-quidem-rerum-facilis-est-et-expedita-distinctio">
Et harum quidem rerum facilis est et expedita distinctio.</h2>
<p>Nam libero tempore, cum soluta nobis est eligendi optio cumque
nihil impedit quo minus id quod maxime placeat facere possimus,
omnis voluptas assumenda est, omnis dolor repellendus.</p>
<ul>
<li>Temporibus autem quibusdam et aut officiis debitis.</li>
<li>Itaque earum rerum hic tenetur a <em>sapiente</em>
delectus.</li>
<li>Lorem ipsum dolor sit amet, consectetur adipisicing elit.</li>
</ul>

What version of the product are you using? On what operating system?

pandoc 1.1 -citeproc -highlighting

Ubuntu 8.10 \n \l

Please provide any additional information below.

The jagged output is hard to read, tidy complains about it and some syntax
highlighters (notably vim) can't handle it correctly.

Original issue reported on code.google.com by bpjonsson@gmail.com on 5 Mar 2009 at 2:30

GoogleCodeExporter commented 8 years ago

This is how Haskell's Text.XHtml library produces indented HTML output.  It is
syntactically correct HTML.  I can't reproduce any tidy complaints -- tidy 
handles it
just fine for me -- and if I want something without "jagged" tags, I just pipe
through tidy.  Note that you can use the --no-wrap option to disable nesting 
and get
maximally compact HTML output.

Original comment by fiddloso...@gmail.com on 25 Mar 2009 at 6:14

GoogleCodeExporter commented 8 years ago

I would like this changed too. This is the first negative thing I noticed when 
I 
first used pandoc. (The second was wide tables.) And I don't want to have to 
download Tidy to fix it. I know it is syntactically correct, but I've never 
seen 
anyone write HTML this way. I understand if the standard library does it, but I 
think an issue should be filed with Text.XHtml to at least offer an option. My 
largest complaint with generated code is that it is never the same as a human 
would 
write it...

Original comment by jmin...@gmail.com on 2 Apr 2009 at 6:02

GoogleCodeExporter commented 8 years ago

I tried Tidy but have problems with it. When indenting, Tidy puts <pre><code></
code></pre> on four separate lines, causing an extra blank line to show up at 
the 
end in the browser; it looks bad. At least pandoc's output looks right in a 
browser. 
Second, Tidy is leaving spaces out after </em>, </code>, etc., making words run 
together.

If pandoc is changed, I think it should treat some tags as block level and some 
as 
inline. The block level ones should be put on new lines indented more, but 
inline 
tags (a, em, strong, code, etc.) get put in the middle of a line:

<div>
  <p>This keyword is <em>very</em> useful: <code>if</code></p>
</div>

Original comment by jmin...@gmail.com on 5 Apr 2009 at 3:32

GoogleCodeExporter commented 8 years ago

My version of tidy doesn't have this issue.  I just ran

~$ echo "*very* important" | pandoc | tidy --show-body-only 1
<p><em>very</em> important</p>

and as you can see it's OK.  Maybe this is an issuo with your tidy configuration
(although I can't ATM think of what option may affect this)?  I think you 
should ask
about this on the html-tidy mailing list.  After all if it is a bug in tidy it 
should
be fixed in tidy and not in Pandoc.

In the meantime piping through a perl oneliner like 

pandoc input.md \
| perl -pi~ -e 's#(</(?:a|em|strong|code)>)[ ]#$1 #g;' \
| tidy --bare yes --quiet yes >output.html

will probably give you an acceptable output, except that the --bare option to 
tidy
will convert *every*   into a space.  It might then be better to replace  
with some entity you're unlikely to actually have in your HTML like 

pandoc input.md \
| perl -p -e 's#(</(?:a|em|strong|code)>)[ ]#$1 #g;' \
| tidy --quiet yes 
| perl -p -e 's// /g' >output.html

Of course you may want to put the three piped lines into an alias so that you 
can say

pandoc input.md | myalias  > output.html

On windows I guess you'd want to turn the whole thing into a bat file,
but there is a reason I switched to Linux: I got batty over batfiles! ;-)

/BP

Original comment by bpjonsson@gmail.com on 5 Apr 2009 at 2:09

GoogleCodeExporter commented 8 years ago

Doh, it should be

pandoc input.md \ 
| perl -p -e 's%(</(?:a|em|strong|code)>)[ ]%$1%g;' \ 
| tidy --quiet yes \
| perl -p -e 's// /g' >output.html

That's what you get for being too copy-pasty!

Original comment by bpjonsson@gmail.com on 5 Apr 2009 at 2:11

GoogleCodeExporter commented 8 years ago

It is probably because the link I downloaded Tidy from had a several year old 
Tidy. 
It looked like it hadn't changed much over the years. When I mentioned changing 
pandoc, I meant changing its output so that Tidy is not needed at all.

Last night, I went ahead and wrote a program that would take the output from 
pandoc 
--no-wrap and indent it like I want. Now I can get nice output using

pandoc file.md --no-wrap | indent_xhtml > file.html

I need to add more code to fix tables though (separate issue).

Original comment by jmin...@gmail.com on 6 Apr 2009 at 2:07

Attachments:

GoogleCodeExporter commented 8 years ago

I'd like to see this fixed, too. While it is indeed valid HTML, that standard is
complicated enough that the principle must be applied of "be liberal in what you
accept, and conservative in what you generate". As well as plain broken HTML 
parsers,
there are also "too-clever" parsers which are supposed to handle invalid HTML
sensibly. An example of the latter is the default Drupal parser, which 
apparently
supplies the "missing" '>' character at the end of each line, and treats the 
'>' on
the next line as part of the text.

While this behaviour is unfortunate, it's arguably correct for a parser which is
designed to handle HTML that has just been typed into a browser window by a 
human.
(And there *is* an option to turn it off, I discovered later.)

This issue came *very* close to causing me to abandon "pandoc" within minutes of
installing it! I didn't have "tidy" handy at the time (I've remedied this now), 
and
although '--no-wrap' fixes the ugly HTML syndrome, I don't particularly want my
documents as a single long line! Currently, I'm using this:

pandoc --no-wrap | sed 's/<p/\n<p/g'

But I'd much rather that pandoc produced beautiful HTML in the first place!

Original comment by LibreSof...@gmail.com on 17 Apr 2009 at 3:33

GoogleCodeExporter commented 8 years ago

This is NOT A BUG and machine-generated HTML should all be like this. The 
reason is
because any whitespace (including newline and tabs) between HTML tags will 
cause the
browser to insert a space character between those elements. It is far easier on 
the
machine logic to leave these spaces out, because then you don't need to think 
about
the possible ways that the HTML text formatting can fuck with the browser adding
extra spaces.

Seriously, if you need pretty HTML then just pipe it through tidy as suggested, 
but
realise that this will lead to some extra space charecters being inserted where 
you
might not want them (eg. between two adjacent images, or something).

Original comment by infinity0x@gmail.com on 16 Nov 2009 at 12:29

GoogleCodeExporter commented 8 years ago

This is maybe not a bug, but it's a bit retarded imho. I mean, who in their 
sane brain would write html like that? nobody.

I suggest that pandoc outputs the expected output and not the "but it's valid 
html!!!1" kind of output.

Original comment by philippe...@gmail.com on 3 Feb 2011 at 3:27

GoogleCodeExporter commented 8 years ago

Concerning Comment 8. But for HTML output, all whitespace is treated the same. 
So why not just split the line where there is anyway a space, and then there is 
no difference. Most machine-generated HTML that I have seen is anyway NOT 
indented, but all on one line. If we want it indented, then we know that this 
entails, and if we don't like it, then we'll say --no-wrap. Isn't indentation 
anyway just about the looks?

Original comment by benm.mor...@gmail.com on 3 Feb 2011 at 4:19

GoogleCodeExporter commented 8 years ago

"Fixed" in 90647a56f6a742e7ecc1cec44042e256ff94b802

Pandoc now prints tidy-like output by default (not indented, but with line 
breaks between blocks).
--no-wrap behaves as before -- no extra spaces.

Original comment by fiddloso...@gmail.com on 5 Feb 2011 at 4:51

Changed state: Fixed

anammari / pandoc

Pandocs output HTML looks jagged with tag-closing angle brackets on the next line #134