Closed GoogleCodeExporter closed 8 years ago
This is how Haskell's Text.XHtml library produces indented HTML output. It is
syntactically correct HTML. I can't reproduce any tidy complaints -- tidy
handles it
just fine for me -- and if I want something without "jagged" tags, I just pipe
through tidy. Note that you can use the --no-wrap option to disable nesting
and get
maximally compact HTML output.
Original comment by fiddloso...@gmail.com
on 25 Mar 2009 at 6:14
I would like this changed too. This is the first negative thing I noticed when
I
first used pandoc. (The second was wide tables.) And I don't want to have to
download Tidy to fix it. I know it is syntactically correct, but I've never
seen
anyone write HTML this way. I understand if the standard library does it, but I
think an issue should be filed with Text.XHtml to at least offer an option. My
largest complaint with generated code is that it is never the same as a human
would
write it...
Original comment by jmin...@gmail.com
on 2 Apr 2009 at 6:02
I tried Tidy but have problems with it. When indenting, Tidy puts <pre><code></
code></pre> on four separate lines, causing an extra blank line to show up at
the
end in the browser; it looks bad. At least pandoc's output looks right in a
browser.
Second, Tidy is leaving spaces out after </em>, </code>, etc., making words run
together.
If pandoc is changed, I think it should treat some tags as block level and some
as
inline. The block level ones should be put on new lines indented more, but
inline
tags (a, em, strong, code, etc.) get put in the middle of a line:
<div>
<p>This keyword is <em>very</em> useful: <code>if</code></p>
</div>
Original comment by jmin...@gmail.com
on 5 Apr 2009 at 3:32
My version of tidy doesn't have this issue. I just ran
~$ echo "*very* important" | pandoc | tidy --show-body-only 1
<p><em>very</em> important</p>
and as you can see it's OK. Maybe this is an issuo with your tidy configuration
(although I can't ATM think of what option may affect this)? I think you
should ask
about this on the html-tidy mailing list. After all if it is a bug in tidy it
should
be fixed in tidy and not in Pandoc.
In the meantime piping through a perl oneliner like
pandoc input.md \
| perl -pi~ -e 's#(</(?:a|em|strong|code)>)[ ]#$1 #g;' \
| tidy --bare yes --quiet yes >output.html
will probably give you an acceptable output, except that the --bare option to
tidy
will convert *every* into a space. It might then be better to replace
with some entity you're unlikely to actually have in your HTML like
pandoc input.md \
| perl -p -e 's#(</(?:a|em|strong|code)>)[ ]#$1 #g;' \
| tidy --quiet yes
| perl -p -e 's// /g' >output.html
Of course you may want to put the three piped lines into an alias so that you
can say
pandoc input.md | myalias > output.html
On windows I guess you'd want to turn the whole thing into a bat file,
but there is a reason I switched to Linux: I got batty over batfiles! ;-)
/BP
Original comment by bpjonsson@gmail.com
on 5 Apr 2009 at 2:09
Doh, it should be
pandoc input.md \
| perl -p -e 's%(</(?:a|em|strong|code)>)[ ]%$1%g;' \
| tidy --quiet yes \
| perl -p -e 's// /g' >output.html
That's what you get for being too copy-pasty!
Original comment by bpjonsson@gmail.com
on 5 Apr 2009 at 2:11
It is probably because the link I downloaded Tidy from had a several year old
Tidy.
It looked like it hadn't changed much over the years. When I mentioned changing
pandoc, I meant changing its output so that Tidy is not needed at all.
Last night, I went ahead and wrote a program that would take the output from
pandoc
--no-wrap and indent it like I want. Now I can get nice output using
pandoc file.md --no-wrap | indent_xhtml > file.html
I need to add more code to fix tables though (separate issue).
Original comment by jmin...@gmail.com
on 6 Apr 2009 at 2:07
Attachments:
I'd like to see this fixed, too. While it is indeed valid HTML, that standard is
complicated enough that the principle must be applied of "be liberal in what you
accept, and conservative in what you generate". As well as plain broken HTML
parsers,
there are also "too-clever" parsers which are supposed to handle invalid HTML
sensibly. An example of the latter is the default Drupal parser, which
apparently
supplies the "missing" '>' character at the end of each line, and treats the
'>' on
the next line as part of the text.
While this behaviour is unfortunate, it's arguably correct for a parser which is
designed to handle HTML that has just been typed into a browser window by a
human.
(And there *is* an option to turn it off, I discovered later.)
This issue came *very* close to causing me to abandon "pandoc" within minutes of
installing it! I didn't have "tidy" handy at the time (I've remedied this now),
and
although '--no-wrap' fixes the ugly HTML syndrome, I don't particularly want my
documents as a single long line! Currently, I'm using this:
pandoc --no-wrap | sed 's/<p/\n<p/g'
But I'd much rather that pandoc produced beautiful HTML in the first place!
Original comment by LibreSof...@gmail.com
on 17 Apr 2009 at 3:33
This is NOT A BUG and machine-generated HTML should all be like this. The
reason is
because any whitespace (including newline and tabs) between HTML tags will
cause the
browser to insert a space character between those elements. It is far easier on
the
machine logic to leave these spaces out, because then you don't need to think
about
the possible ways that the HTML text formatting can fuck with the browser adding
extra spaces.
Seriously, if you need pretty HTML then just pipe it through tidy as suggested,
but
realise that this will lead to some extra space charecters being inserted where
you
might not want them (eg. between two adjacent images, or something).
Original comment by infinity0x@gmail.com
on 16 Nov 2009 at 12:29
This is maybe not a bug, but it's a bit retarded imho. I mean, who in their
sane brain would write html like that? nobody.
I suggest that pandoc outputs the expected output and not the "but it's valid
html!!!1" kind of output.
Original comment by philippe...@gmail.com
on 3 Feb 2011 at 3:27
Concerning Comment 8. But for HTML output, all whitespace is treated the same.
So why not just split the line where there is anyway a space, and then there is
no difference. Most machine-generated HTML that I have seen is anyway NOT
indented, but all on one line. If we want it indented, then we know that this
entails, and if we don't like it, then we'll say --no-wrap. Isn't indentation
anyway just about the looks?
Original comment by benm.mor...@gmail.com
on 3 Feb 2011 at 4:19
"Fixed" in 90647a56f6a742e7ecc1cec44042e256ff94b802
Pandoc now prints tidy-like output by default (not indented, but with line
breaks between blocks).
--no-wrap behaves as before -- no extra spaces.
Original comment by fiddloso...@gmail.com
on 5 Feb 2011 at 4:51
Original issue reported on code.google.com by
bpjonsson@gmail.com
on 5 Mar 2009 at 2:30