beautifier / js-beautify

Beautifier for javascript
https://beautifier.io
MIT License
8.62k stars 1.38k forks source link

Improve default list of "block" formatted HTML5 elements. #1732

Open garretwilson opened 4 years ago

garretwilson commented 4 years ago

Hi, @bitwiseman . This ticket is just a suggestion to try to be helpful. Feel free to use it or not. It doesn't make a real difference to me, but someone might appreciate it.

You'll remember that we had a big discussion years ago on Issue #841 about rules of an ideal formatting engine. And I want to reaffirm that I am very grateful for your help and that of @madman-bob for finally fixing the core formatting rules in Issue #1033. I can't say "thank you" enough.

Many, many years ago I had written my own XML/HTML serializer in Java, but I had never got around to implementing a comprehensive set of whitespace formatting rules. I'm now overhauling the formatting engine (you can track it on Jira in JAVA-158), and I'm almost done.

Once I'm completely finished I can try to summarize the rules I came up with if you're interested. One of the interesting outcomes is that we don't actually need all the categories I proposed when I was outlining a formatting algorithm off the top of my head. It turns out we only need categories of "block" and "inline", although of course some elements (e.g. <pre>) will have formatting disabled.

Additionally since browsers normally format unrecognized elements as display: inline, we would want to merely specify the block elements and let everything else default to inline.

I'm writing this ticket because I was looking for an official list of "block" elements. This is not a semantic category, but rather a list of elements that the HTML specification recommends the browser should by default format as display: block. (See Browsers' default CSS for HTML elements on stack overflow.) You can find these elements in the most recent W3C HTML5 specification at HTML 5.2 § 10. Rendering. Lastly it turns out you need to include, not just those elements that default to display: block, but also those that default to display: list-item in order to include <li>.

I'm using a js-beautify formatted HTML file in unit tests for my Java code to see how much the output of my algorithm differs. The resulting formatted documents is almost the same. The one difference I noted is that js-beautify formats:

<figure><figcaption>A "Hello World" Java Program</figcaption>
  <pre class="line-numbers"><code class="language-java">package com.example;
…

While my Java implementation formats this as

<figure>
  <figcaption>A "Hello World" Java Program</figcaption>
  <pre class="line-numbers"><code class="language-java">package com.example;
…

This is because I'm going off the official list of recommended browser default display: block elements as explained above, and <figcaption> by default should be a block element as per the W3C / WHATWG.

Anyway this is less a request than a suggestion that you may find helpful. It would probably be nice to have js-beautify follow a more official list, but it's not causing me great problems at the moment.

All the best!

garretwilson commented 4 years ago

Actually I spoke too soon. It looks like we'll need to include some semantic categories as well, including metadata elements and script elements. I'll come back and update this ticket with a complete list.

bitwiseman commented 4 years ago

I look forward to seeing the complete list. This has list of "phrasing-content" as I understood it: https://github.com/beautify-web/js-beautify/pull/1407/files#diff-73474d346ba5e7ef772942fca4411708R94

I think the current behavior is to treat elements as block by default. Your suggestion here is to invert the behavior so that elements are treated as inline by default?

garretwilson commented 4 years ago

I think the current behavior is to treat elements as block by default.

That's what I recall.

Your suggestion here is to invert the behavior so that elements are treated as inline by default?

Exactly. There is no standard that I'm aware of that says, "here is how HTML source code should be formatted". But for the most part I think most users would agree that that the HTML source code formatting should reflect the rendering of the HTML in the browser for the most part. (That's a reasonable starting point, anyway, although users of course may want to override some things.)

Browser rendering is based on CSS, which defaults to display: inline for unknown elements. You can read an explanation of this on Stack Overflow, but in short the CSS display property has an initial value of inline as per the CSS 2.1 spec. The CSS3 Display Module says the same thing.

The only problem is that this only applies to non-hidden elements. Things like <head> are considered "hidden" so I'm determining the best way to find a semantic grouping that would include the right hidden elements that we expect to be formatted as block in the HTML source code.

garretwilson commented 4 years ago

This has list of "phrasing-content" as I understood it ...

Yes, I'm pretty sure I'm the one who suggested to use HTML5 "phrasing content" in #840. The definition says basically that it's stuff in a paragraph, so for the most part I think those do reflect the right things.

Now that I'm looking more closely at it, I'm not sure it's so straightforward with nested things. For example "phrasing content" includes <area> inside a map, which someone might want to be placed on individual lines. And what about <param> inside <object> and <source> inside <video>? From the MDN examples, it looks like they would look nicer on separate lines, even though <object> and <video> are "phrasing" elements. But then again if the <object> had no <param> children, we wouldn't want it by itself break a paragraph.

I'm almost thinking my original idea of a "container" classification might be useful. A container would:

But that is getting pretty complicated. And besides I'm not sure how I would want these sort of elements formatted. Maybe they would be fine just formatted inline in a paragraph.

I'm still not 100% sure how I would want <iframe> to be formatted, even though it's "phrasing content". And what about <frame> inside a <frameset> (although those are obsolete and should not be used)?

I think the biggest doubt I have is things like <dataset><option>, and of course <option> within a <select>.

For now I'm going with this definition of "block element":

For a pretty complex HTML document, that matches almost exactly the js-beautify output using my algorithm (except for <figcaption> as mentioned). The only other difference is the order of attributes, which my algorithm changes to be consistent based upon certain rules.

For me that will work for now, and I'll keep thinking about what want to do with the nested things like <option> and <param> and such.

garretwilson commented 4 years ago

My Java-based XML/HTML formatter is finished with JAVA-158 and its corresponding pull request, which was just merged. This is significant for me because I wrote the XML processing code around 20 years ago (originally I wrote the entire XML parser, but I now use the built-in Java XML parser) and the XML serializer I wrote not too long after; but I had never got around to formatting the whitespace in a meaningful way. Now the output is virtually identical to that of js-beautify.

I want to stress that the use case for my formatter is slightly different from that of js-beautify. Your formatter is meant to take existing HTML source code and make it prettier by improving it, while still maintaining certain aesthetic decisions of the author (e.g. whether the author places extra newlines between paragraphs, etc.). My formatter, on the other hand, is more oriented towards generating HTML. It will form part the core HTML generation of Guise™ Mummy, my static site generator (which I expect will outperform and produce more standards-compliant content than competitors such as Hugo and Jekyll).

Thus my formatter works from the entire parsed DOM XHTML tree, and is more deterministic. It's not necessarily more opinionated (although it has far fewer options than js-beautify at the moment), but for its set parameters it controls everything for completely reproducible and consistent output. For example you can look at the HtmlSerializer which sets up a default format profile that specifies even the order of common HTML attributes. We would want id to always appear in front of class and style, for instance. (One can plug in a different HTML profile to override this of course.)

Because I'm working from the DOM tree, I don't have to worry about parsing and can instead concentrate on the semantic structure. This led to a pretty concise algorithm. Look at the documentation and the logic of the XMLSerializer.serializeContent(Appendable, Node, boolean) method for more information. If you're interested in discussing it further I can try to summarize the algorithm better, although the API Javadocs for that method gives a rough overview.

The exciting thing about this formatting algorithm is that I can do the normalizing and formatting on the fly in an immutable way, that is without modifying the DOM tree! I wasn't sure it was possible, but it works great. I can normalize whitespace, strip out the line endings, and then use an algorithm based upon the HTML profile to add back line endings and indents according to whether children are block elements.

Anyway I just opened this ticket to share my research about block elements. Feel free to close the ticket if you don't plan to make any changes to js-beautify, and let me know if you have any questions about any of this.

Happy holidays.