PataphysicalSociety / soupault

Static website generator based on HTML element tree rewriting
https://soupault.app
MIT License
364 stars 17 forks source link

Soupault's HTML prettifying doesn't preserve whitespace correctly #46

Open untitaker opened 2 years ago

untitaker commented 2 years ago

The following markdown document:

# Welcome to my website.

is converted by pandoc -f markdown -t html -fmarkdown-implicit_figures --no-highlight into:

<h1>Welcome to my website.</h1>

However, after soupault is done with parsing the output, the following HTML is produced:

<h1>
  Welcome to my website.
</h1>

This introduces another space after the period, which is visible in selections in Firefox, and does not have visible effect in Chrome. See also https://github.com/whatwg/html/issues/8003

However, regardless of how browsers handle this, I think soupault should allow me to remove the trailing whitespace, and especially not mangle it by itself. Ideally, a HTML5 tokenizer should produce the same exact tokens before and after soupault has parsed and serialized the document.

dmbaturin commented 2 years ago

This is an interesting issue indeed... Intuitively, <pre> is the only element where leading and trailing whitespace around the element content should be significant, so my opinion is that all browsers should ignore it.

However, I agree that the current "always put tags on separate lines" approach is a bit heavy-handed and often produces a result that is the opposite of pretty. I'd be happy to work with the maintainer of lambdasoup to make it more flexible, but I suppose we'll have to wait for WHATWG's response regarding whitespace significance to know whether the current behavior should still be allowed or not.

Meanwhile, you can disable pretty-printing with pretty_print_html = false under [settings].

dmbaturin commented 2 years ago

Correction: with pretty_print_html = false, of course! I edited the original comment to fix that.

I should probably also improve the docs for that section because right now all those options are lumped together in "Basic configuration" now, but the commented config sample with them is really huge.

untitaker commented 2 years ago

I don't think we have to wait to see what the browser vendors and spec body does with this issue. A functional HTML tokenizer and parser needs to keep the whitespace intact, this is very clear from the WHATWG spec. pretty_print_html=false definetly solves my issue, I also think it would be a better default.

egrieco commented 2 years ago

Is there actually an extra space U+0020character, or is the browser rendering the line feed U+000A? There is no trailing space character in the above example.

This may actually be a browser issue.

P.S. If you want specific formatting run a prettifier or a minifier on the code after Soupault generates the site. I was doing this with Zola before I found Soupault and it actually helped me catch a few errors in the framework I was using.

I'd love to see asset pipeline or post-processing support in Soupault, though it's not really that difficult to just do those steps manually or in a shell script.

dmbaturin commented 2 years ago

@egrieco Since 4.0.0, you can use the "save" hook to take over the output writing stage. The only shortcoming is that there's no Lua function that would allow you to send a string to external filter's stdin... however, it's not hard to add, it's just that I haven't had a use case for it yet and no one else asked me to add it.

If an HTML formatter supports modifying a file in-place, it's a non-issue, of course—you can just run it on the page file after writing it.

I wonder if I should also add a separate "post-write" hook specially for these cases, though.

egrieco commented 2 years ago

Yeah, I hadn't gotten around to looking at if an "asset pipeline" could be implemented directly within Soupault. This would be useful to generate several sizes of images and potentially several formats to use in scrsets.

P.S. @dmbaturin Soupault is one of the coolest and most useful pieces of software I've run across in at least a decade. You really saved my students. I've been wondering how I was going to go from basic "intro to web dev" to a static site generator without a lot of needless pain. Almost all of the generators have some major flaw that contributes to severe friction or limitations in what sites can be built.

I cannot thank you enough for Soupault. I have plenty more to say, but don't want to pollute this issue. :)

dmbaturin commented 2 years ago

@egrieco Maybe make a separate issue for discussions of post-processing. In fact, I do already have a plugin that handles assets in a non-trivial way: https://github.com/dmbaturin/iproute2-cheatsheet/blob/master/plugins/inline-assets.lua reads asset files and inlines them into the page (CSS and JS as is, images Base64 encoded).

egrieco commented 2 years ago

@dmbaturin Soupault just keeps getting better and better. :)

I haven't been playing with Soupault for even a full day yet. I'm setting up several sites in it now. Let me get a better handle on what it can actually do so I don't file any spurious issues.

In the meantime I sent you an email from my @egx.com address. My profound thanks for building Soupault.

delan commented 6 months ago

Intuitively, \

 is the only element where leading and trailing whitespace around the element content should be significant, so my opinion is that all browsers should ignore it.

This is not really a safe assumption to make because of CSS. I ran into this with a retrocomputing website where I use “older” techniques like building navigation with nav > ul > li { display: inline-block }. Here’s a minimal example:

# soupault.toml
[settings]
  generator_mode = false
  pretty_print_html = false
<!-- site/index.html -->
<!doctype html>
<meta charset="utf-8">
<style>
    nav li {
        display: inline-block;
        outline: 1px solid;
        padding: 0.5em;
    }
</style>
<nav><ul>
    <li>home
    <!-- implied </li> --><li>about
    <!-- implied </li> --><li>projects
    <!-- implied </li> --><li>contact
<!-- implied </li> --></ul></nav>
pretty_print_html
false image
true image

It would be good for lambdasoup to prettify HTML in a way that doesn’t affect whitespace between elements, or even without affecting whitespace in text nodes at all (because it changes the DOM), but in the meantime, I’m happy to send a patch to warn about this in the docs and default toml if you like. Thanks for making soupault!