TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
282 stars 84 forks source link

New tei-c site breaks the build process? #1783

Closed martindholmes closed 6 years ago

martindholmes commented 6 years ago

The TEIP5-Documentation-dev build failed this morning in what I think was its first build run since the new TEI website was unveiled. The problem comes here in the Makefile:

    if [ -n ${GOOGLEANALYTICS} ] ; then curl -sL http://www.tei-c.org/ | sed 's/content="text\/html"/content="text\/html; charset=utf-8"/' | xmllint --html --noent --dropdtd --xmlout - > Utilities/teic-index.xml;fi

where we curl the home page of the TEI site and try to transform it. The problem is that the new site home page is not XHTML5 but HTML5, and it's not processable with xmllint.

Possible solutions:

lb42 commented 6 years ago

Piping it through HTML tidy is probably the quickest fix. Or just don't do this weird step at all.

martindholmes commented 6 years ago

I think we need the file in order to get menus and the like for the TEI site version of the Guidelines. It's always been a bit fragile, but when the site pages were all generated by XSLT from XML sources, we could be sure the result was processable.

But Tidy is a problematic solution because it's a binary, and only newer version support HTML5. My brand-new Ubuntu 18.04 has a version from 2016 in the repos, but the TEI jenkins server I run is much older and its only available version is from 2009; to be sure this would work, we'd have to put various binaries (for different platforms) into the TEI repo.

lb42 commented 6 years ago

In that case surely we could ask the web master to make available a version of the information in the format we need? (we should generate one now in any case). Scraping it from the web isn't the answer. i

martindholmes commented 6 years ago

The reason we scrape it is that the website changes, and one reason for switching to the new version is that it makes it easier to change (and therefore we can expect it to change more). So I think we're stuck with scraping. But if it's impossible to get WordPress to produce decent XHTML5, we may have to do some text manipulation to get just the bits of text we want. I've just assigned Kevin to the ticket so he can perhaps let us know how practical it would be to wrestle WordPress into producing valid XHTML5.

martindholmes commented 6 years ago

This suggests that WordPress SHOULD do the right thing:

https://make.wordpress.org/core/handbook/best-practices/coding-standards/html/

kshawkin commented 6 years ago

I suspect that https://make.wordpress.org/core/handbook/best-practices/coding-standards/html/ has been written in parts by various people, leading to inconsistent use of "HTML" and "XHTML" and not actually specifying the version of (X)HTML against which any WordPress code should validate. Since the way of the web is currently HTML5, which is not well-formed XML, I don't think we can expect that WordPress components, especially in case we add any plugins, will always be XHTML.

As Martin noted, we've been scraping the current website at build time so that the HTML version of the guidelines will share the same navigation menu as the current website. I think HTML Tidy will be the safest route to go.

Please note that whatever code has been used to grab just the menu from the homepage will probably need to be updated. The homepage is no longer called index.xml, and there are probably now different class attributes on the various components. If it helps, please feel free to compare with http://www-old.tei-c.org/ , where OpenCMS is still running.

martindholmes commented 6 years ago

The WordPress page explicitly states that all elements should be properly closed, which is XHTML rules rather than HTML rules. But the actual HTML on tei-c is invalid, so clearly WordPress isn't following its own guidelines.

If we go with HTML Tidy as the solution, then I think my Jenkins server will have to be retired in favour of Peter's, and I'll have to rebuild it from scratch. That was on my agenda anyway for this year, though; the 14.04 Ubuntu LTS it's running is only supported till next April.

lb42 commented 6 years ago

The more I hear about this can of worms, the more I think that Martin's first proposed solution above (fix the home page) is the only sensible one. Which is not to say that he shouldn't upgrade his server as well :-)

martinascholger commented 6 years ago

Agreed - XHTML5 would be a good solution. I still wonder at which point we need the file, since there are no menus in the Guidelines. I thought the Guidelines are independent ...

I validated the source code (www.tei-c.org) and there are some problems with the elements nested in the <footer>. A <p> encloses the whole content; <span> encloses <div>s and <h2> elements. Could we get that fixed and check if that brings us a step further? The validator also complains about the value of "rel", since this is not a standard keyword: <link rel='https://api.w.org/' href='http://www.tei-c.org/wp-json/' />.

kshawkin commented 6 years ago

The navigation menu was included in the Guidelines snapshot through version 2.9.1 ( http://www.tei-c.org/Vault/P5/2.9.1/doc/tei-p5-doc/en/html/ ) but has not appeared since then. I'm not sure whether that was an intentional decision, a work-around, or an oversight.

As for the invalid code in <footer id="footer">, this is all generated by WordPress when inserting footer widgets. Similarly, the code link@rel is inserted by WordPress as well. Neither of these things are coded by hand, and I'm reluctant to fork WordPress by hacking whatever generates this invalid code.

ebeshero commented 6 years ago

Someone coded the WordPress widgets, and they’d probably be interested in helping to find a solution to our problem, as well as to ensure compliance with their own policies. I’m sure WordPress doesn’t want to be producing invalid and deprecated code, and we should probably get them involved in helping to solve this problem.

Elisa -- Elisa Beshero-Bondar, PhD Director, Center for the Digital Text | Associate Professor of English University of Pittsburgh at Greensburg | Humanities Division 150 Finoli Drive Greensburg, PA 15601 USA E-mail: ebb8@pitt.edu mailto:ebb8@pitt.edu Development site: http://newtfire.org http://newtfire.org/

On Jul 8, 2018, at 9:14 AM, Kevin Hawkins notifications@github.com wrote:

The navigation menu was included in the Guidelines snapshot through version 2.9.1 ( http://www.tei-c.org/Vault/P5/2.9.1/doc/tei-p5-doc/en/html/ http://www.tei-c.org/Vault/P5/2.9.1/doc/tei-p5-doc/en/html/ ) but has not appeared since then. I'm not sure whether that was an intentional decision, a work-around, or an oversight.

As for the invalid code in

yoannspace commented 6 years ago

About the <link rel='https://api.w.org/' href='http://www.tei-c.org/wp-json/' />, we have also seen this issue for another project and decided to simply delete it. No need to go through the core source code of WordPress for this. Just create a simple plugin (1 php file will be sufficient) and add a couple of remove_action():

remove_action( 'template_redirect', 'rest_output_link_header' );
remove_action( 'wp_head', 'rest_output_link_wp_head' );

According to WordPress, it was still valid... https://core.trac.wordpress.org/ticket/37841 but we also had library issues when parsing our website.

Would you need more help on this, I would be willing to help. Also, for the other issue (on the footer), I've never seen it but could give it a go if you'd like.

Best, Yoann

kshawkin commented 6 years ago

As noted at https://core.trac.wordpress.org/ticket/37841 , the HTML code is valid according to the WHATWG spec for HTML but not the W3C's.

So, we could create a new PHP file as a plugin to remove the REST output link. But are we sure that we don't want to provide REST output for the website?

yoannspace commented 6 years ago

Sorry, I understood it the other way: it now is valid according to both specs, W3C and WHATWG since it is in the microformats.org page.

But anyway, the REST output is still available at the given URI, but the meta element will simply be omitted in the header to pass validation. To delete the REST output altogether, you'd probably need more code which I haven't looked for yet.

EDIT: Sorry, I said meta element, but meant link element.

peterstadler commented 6 years ago

I think the issue is a little bit different: Not the build itself fails, but the build log parsing turns it into a failed build. See the log output:

BUILD SUCCESSFUL
Total time: 1 minute 36 seconds
Build step 'Console output (build log) parsing' changed build result to FAILURE
Not sending mail to unregistered user […]
Sending e-mails to: […]
Finished: FAILURE

That could be remedied by adding an exception to our tei-log-parse-rules, stating

ok /HTML parser error/

I just tried that on my Jenkins, see https://jenkins-paderborn.tei-c.org/view/TEI%20dev/ and it seems to work!

Second, the resulting file Utilities/teic-index.xml is indeed wellformed, so I think we might just ignore these 'errors' – while we might need to adjust some XSLTs to take care of the changed class names etc. as @kshawkin suggested.

martindholmes commented 6 years ago

If the removal of the site menu from the Guidelines as they appear on the TEI-C site was intentional, then there's no need to do any of this part of the build; we only need it if we want to include the site menu, as we used to do.

kshawkin commented 6 years ago

Ah, here's where we took out the navigation menu: https://github.com/TEIC/TEI/issues/1760

hcayless commented 6 years ago

Just looking at this. The xmllint failure is not due to it's inability to parse HTML, but HTML5, and it's failing because HTML5 has new elements and libxml2 still doesn't know about them.

We could simply ignore the error, which would mean assuming some risk—if we're ignoring errors, a real error might sneak through. I've tested this and it seems to work ok on the current website.

Next step is to follow this through and see what's supposed to be happening. I don't think we intentionally dropped the menus, but might well have just not noticed that they'd stopped working.

hcayless commented 6 years ago

Think it's working now. Running make teiwebsiteguidelines builds a version with the menus. That's not to say we shouldn't look further at it, but I think at least it's not a release blocker now.

jamescummings commented 6 years ago

Have these appeared on the jenkins version yet?

hcayless commented 6 years ago

Yes. I think the only way to get at it is to download http://jenkins.tei-c.org/job/TEIP5-dev/lastSuccessfulBuild/artifact/P5/teiwebsiteguidelines.zip and unzip it, but the files in there have menus.

martindholmes commented 6 years ago

I'll close this then -- problem solved.

jamescummings commented 6 years ago

Just being suspicious that I don't see these at

http://jenkins.tei-c.org/job/TEIP5-dev/lastSuccessfulBuild/artifact/P5/release/doc/tei-p5-doc/en/html/index.html

martindholmes commented 6 years ago

@jamescummings The version of the Glines built for tei-c is slightly different from the generic one. As Hugh says, you have to download the zipped package intended for tei-c to see the full site version.

jamescummings commented 6 years ago

@martindholmes: I trust you. :-)

martinascholger commented 6 years ago

It works :-)