TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
279 stars 88 forks source link

Need to add a check for internal referential integrity on the Guidelines HTML #1563

Open martindholmes opened 7 years ago

martindholmes commented 7 years ago

As far as @sydb and I can tell, there is no check run on the HTML of the Guidelines to test that all internal links are working. The Guidelines-Link-Check task on Jenkins checks the external links, but something should check that internal links are working. This could be run as part of the TEIP5-Documentation[-dev] job.

hcayless commented 7 years ago

Internal links involve a possibly unhealthy degree of magic in any case (same-document links to other documents, e.g.). We might need to resolve that issue before we can properly do link-checking.

martindholmes commented 7 years ago

One way to do it might be to check links in the HTML output.

tuurma commented 6 years ago

@hcayless suggests that running a ccheck on compiled P5 would be another option

raffazizzi commented 6 years ago

Added @martindholmes in hope he'll be willing to help

martindholmes commented 6 years ago

I've invited @joeytakeda to join the TEI Contributors team to work with me on this one, because we developed the diagnostics code to do this together.

hcayless commented 6 years ago

Actually, doesn't the ePub build essentially do this? (Says he after spending a big chunk of Friday fixing broken links)

martindholmes commented 6 years ago

If the ePub build were checking everything, then the regular build process would break every time there's a the Guidelines-Link-Check job was failing, but that's not the case. Won't the ePub check only catch things that make it into the English version of the Guidelines?

martindholmes commented 5 years ago

What we should do is to extract the internal-link-specific checking from here:

https://github.com/projectEndings/diagnostics/blob/master/xsl/diagnostics_master.xsl

and build that into the Test process, running against the compiled P5.

martindholmes commented 5 years ago

Having run that diagnostic on p5.xml (it's designed to run on TEI, rather than on HTML), I found a bunch of issues, most of which are now fixed, but I think it would make sense to take the same approach to creating an XHTML version. The diagnostic isn't designed to work in a headless context; it generates an HTML report and opens it in your browser. We would want something that simply emits error messages that would cause the build to fail (or perhaps just warnings). That's straightforward to do. Just takes time...

hcayless commented 3 years ago

I've had some success in writing XQuery to find unresolved references in p5.xml. It strikes me that the converse should also be checked: find items in the bibliography that are no longer referred to. Those shouldn't necessarily be removed, but they might be candidates for removal.

Fwiw, here's my (rather rudimentary) XQuery:

declare namespace eg="http://www.tei-c.org/ns/Examples";
for $e in //*/@*[starts-with(.,'#')]
where not(//*[@xml:id = substring-after($e,'#')])
where not($e/ancestor::eg:egXML)
return $e/parent::*

Something like this could be run as a test on p5.xml to turn up broken links. Ideally, the output should be empty.

hcayless commented 3 years ago

Here's an XQuery that finds <bibl> and <biblStruct> elements in the Bibliography (not in the Reading List portion) that are no longer linked to and should be candidates for updating or purging:

declare namespace t="http://www.tei-c.org/ns/1.0";
for $bib in (//t:div[@xml:id='BIB']//t:bibl/@xml:id | //t:div[@xml:id='BIB']//t:biblStruct[not(ancestor::t:div[@xml:id='BIB-RDG'])]/@xml:id)
where not(//*[@target = concat('#',$bib) or @source=concat('#',$bib)])
return $bib/parent::*
sydb commented 3 years ago

See https://github.com/TEIC/TEI/issues/1476#issuecomment-249831158, related (mostly inasmuch as it provides evidence that yes, such a check is a good idea).

martindholmes commented 2 years ago

Just ran the diagnostics XSL against a newly-built p5.xml and found the following:

Bad internal link: target: #ISO24611

Bad @xml:lang values: xml:lang: lat xml:lang: lat xml:lang: cornu xml:lang: cornu

MIME types not listed in the IANA mime types list: mimeType: audio/wav mimeType: image/jpeg mimeType: audio/wav mimeType: image/gif mimeType: application/x-musescore mimeType: text/xsl

I'll check these out and fix if necessary.

martindholmes commented 2 years ago

I've fixed the first two lots, but the media types issue is a bit fraught; although the listed ones are not in the IANA media types listing page, they do show up in other locations, so I've left them alone. The "cornu" language value was wrong; "cornu" is a variant subtag meaning Cornish-English, and I'm pretty sure that the intent for the word in question was Middle Cornish, which has the subtag "cnx", so I've changed it to that.

For the broken pointer, I added a new item to the bibliography.

We can't close this ticket because we don't have a working automated solution yet, but we can consider it done for the upcoming GL release.