[bikeshed] Bikeshed broke again

rossberg commented 1 year ago

The bikeshed build started breaking again, across all repos, without any changes on our end. Relevant part of the CI transcript:

217  python util/bikeshed_fixup.py _build/bikeshed_singlehtml/index.html \
218          >_build/bikeshed_singlehtml/index_fixed.html
219  mkdir -p _build/bikeshed_mathjax/
220  bikeshed spec index.bs _build/bikeshed_mathjax/index.html
221  LINE 2:1: Malformed HTML comment '<!DOCTYPE '.
222  ✘  Did not generate, due to fatal errors

I have no idea what is causing this or how to fix it. The input file (index_fixed.html) looks fine on my local machine.

tabatkins commented 1 year ago

Ah, the fact that doctype can't have any whitespace before it is a mistake on my part. I'll fix, sorry about that.

The other part of #1626 (making sure that pre elements start on their own line) is something that I'm enforcing now because it was always a potential source of problems - the Markdown parser only looks for the start tags at the start of the line (optionally with whitespace), so if anything else came before them they wouldn't get recognized as "raw elements" that avoid Markdown parsing inside of them.

tabatkins commented 1 year ago

Could you point me to where that pre error is being reported, btw? This repo is too complicated for me to immediately figure out how to parse it.

tabatkins commented 1 year ago

Hm, I'm also not sure how the doctype fix in #1626 is fixing anything. A doctype before the metadata block will trigger a (new) complaint that you should put the metadata first, a doctype after the metadata block will avoid tripping the doctype-strip so it'll instead try to parse as a comment (and fail).

You don't ever need a doctype in a Bikeshedded document, fwiw - the serializer always outputs a <!doctype html> for you.

rossberg commented 1 year ago

Hi Tab, thanks for looking into this!

Re DOCTYPE: Sphinx apparently produces an HTML file with an empty line before the DOCTYPE comment. This file is included by document/core/index.bs. The error message quoted in the OP is rather strange, and it took my various attempts to fix it, but when I removed that empty line as per #1626, bikeshed stopped complaining.

Re PRE: After fixing the DOCTYPE issue, the PRE error showed up. That was somewhere on line 12XYZ of this generated file, so I cannot easily point to it. But if you remove the second patch line added by the PR, and then run (cd document/core; make bikeshed) in this repo, you should be able to recreate it.

Unfortunately, the HTML is generated by another third-party tool. And IIUC, it's in its right to generate this, as it is perfectly fine HTML. So it's outside our power to prevent the PRE case, other than post-processing it away. That's a bit unfortunate.

the Markdown parser only looks for the start tags at the start of the line (optionally with whitespace), so if anything else came before them they wouldn't get recognized as "raw elements" that avoid Markdown parsing inside of them.

Well, I know some who would probably argue that this demonstrates that markdown, especially when mixing it with HTML, is perhaps a subideal concept. As is employing regexps for parsing, like many respective tools unfortunately do. :)

tabatkins commented 1 year ago

The error message quoted in the OP is rather strang

Yeah, the main parser function I'm using doesn't look for doctype nodes in its main loop, just all the other node types. I strip a leading doctype from the start of the content before doing the main loop, so if it doesn't get stripped there, instead it ends up attempting to do a comment parse, failing, and then falling down to plain text. I should go ahead and recognize doctype nodes in the main loop and just throw them away, instead.

And IIUC, it's in its right to generate this, as it is perfectly fine HTML.

Note that Bikeshed source files are not strictly HTML, they're actually Bikeshed-flavored Markdown, which is a slight variant on standard markdown to be much friendlier to including lots of raw HTML. But that does mean that some things that are "perfectly fine HTML" won't work well with Bikeshed! If you had content preceding your pre start tags before, then it was only working correctly by accident, basically, and might have broken at any point in the future. (Not y'all's fault, tho; I wasn't previously in a position to detect and report this correctly.)

(The big change here, btw, is me switching away from regexps in several places and employing a new bespoke HTML parser instead. But Markdown is still fundamentally a significant-newlines syntax, so it is safe to use regexes there, within the confines of Markdown-specific stuff.)

It sounds like you're actually generating plain HTML, tho, and then just feeding that to Bikeshed, so the Markdown step (ideally) isn't doing anything at all for you?

rossberg commented 1 year ago

It sounds like you're actually generating plain HTML, tho, and then just feeding that to Bikeshed, so the Markdown step (ideally) isn't doing anything at all for you?

Yes, that's right, I believe. AFAICT, we are only using it to generate the standardised front matters. If there was a way to tell it not to attempt processing the HTML include, that might solve many of our problems.

(That said, Brad set up our bikeshed procedure, so I may be missing something relevant.)

tabatkins commented 1 year ago

Yeah, it would be pretty easy to do a "skip the markdown part" config, if y'all are indeed not using it.

KayKay1377 commented 1 year ago

We will get it done in the next few years im sure

rossberg commented 1 year ago

Cool, thanks! Keep us posted!

tabatkins commented 1 year ago

(Sorry, I have no idea who that was, in case you thought it was a reply about my last comment. I still need an answer to "if y'all are indeed not using it" before I'll do anything.)

KayKay1377 commented 1 year ago

I have to ask my husband 1st

rossberg commented 1 year ago

(Oops.)

I just checked the build script. Our HTML is generated by Sphinx, so that should be pure HTML. It is then run through a fixup script, which tweaks it for bikeshed consumption. But all that does is replacing some HTML with other HTML -- with one exception:

  # Use bikeshed biblio references for unicode and IEEE754
  data = data.replace(
      """<a class="reference external" href="https://www.unicode.org/versions/latest/">Unicode</a>""",
      "[[!UNICODE]]"
  )
  data = data.replace(
      """<a class="reference external" href="https://ieeexplore.ieee.org/document/8766229">IEEE 754-2019</a>""",
      "[[!IEEE-754-2019]]"
  )

So this inserts two bikeshed biblio references. But that's the full amount of non-HTML that we feed bikeshed, outside the small index.bs wrapper which includes this HTML file. Perhaps there is a way to even avoid hacking these references?

tabatkins commented 1 year ago

Yeah that's fine, the issues I was alluding to are just about parsing Markdown block constructs. All the inline stuff (both Markdown things like *foo*, if you have that option turned on, or things like biblio links/etc) is reasonably non-problematic.

It is possible to avoid even relying on those, tho - everything you can communicate in a shorthand you can write as plain HTML with attributes. See the docs for biblio links; in particular, you could write these as <a data-link-type=biblio data-biblio-type=normative>UNICODE</a> (or IEEE-754-2019).

I've filed https://github.com/speced/bikeshed/issues/2505 for you, about the ability to turn off the Markdown block processing entirely.

(Note that I've also reverted the problematic commits from the last two weeks until I get the bugs ironed out; see the post-mortem.)

rossberg commented 1 year ago

Very cool, thanks for the tip! And I'll watch the issue, such an option would probably help minimising interference between the multiple tools we depend upon.

ericprud commented 1 year ago

Apologies if I'm asking stupid Qs here; if this is for a different doc, feel free to set me straight.

If this is to build the single-page doc, who's using that doc? The last pub I did was a multi-page doc, which appeared to make the UI way faster.

BTW, I used sphinx-to-tr, which uses a respec template to build the W3C TR front matter. I chose it over bikeshed just 'cause the rest of the code was in JS (uses jsdom to manipulate the post-load promise from respec). The other two docs were built with bikeshed with one manual intervention to add the

<h2 id="subtitle" class="title">Version 2.0</h2>

(e.g. line 298 of js-api). Any advice to avoid scripting in awk or whatnot much appreciated.

rossberg commented 1 year ago

Hi @ericprud, thanks for the link. The problem is that this hasn't been integrated into our build process in the repo, so CI still builds singlehtml with bikeshed. I'll try to figure out if I can integrate it.

ericprud commented 1 year ago

ping me if you want to pair on that

WebAssembly / spec

[bikeshed] Bikeshed broke again #1625