Open martindholmes opened 7 years ago
Internal links involve a possibly unhealthy degree of magic in any case (same-document links to other documents, e.g.). We might need to resolve that issue before we can properly do link-checking.
One way to do it might be to check links in the HTML output.
@hcayless suggests that running a ccheck on compiled P5 would be another option
Added @martindholmes in hope he'll be willing to help
I've invited @joeytakeda to join the TEI Contributors team to work with me on this one, because we developed the diagnostics code to do this together.
Actually, doesn't the ePub build essentially do this? (Says he after spending a big chunk of Friday fixing broken links)
If the ePub build were checking everything, then the regular build process would break every time there's a the Guidelines-Link-Check job was failing, but that's not the case. Won't the ePub check only catch things that make it into the English version of the Guidelines?
What we should do is to extract the internal-link-specific checking from here:
https://github.com/projectEndings/diagnostics/blob/master/xsl/diagnostics_master.xsl
and build that into the Test process, running against the compiled P5.
Having run that diagnostic on p5.xml (it's designed to run on TEI, rather than on HTML), I found a bunch of issues, most of which are now fixed, but I think it would make sense to take the same approach to creating an XHTML version. The diagnostic isn't designed to work in a headless context; it generates an HTML report and opens it in your browser. We would want something that simply emits error messages that would cause the build to fail (or perhaps just warnings). That's straightforward to do. Just takes time...
I've had some success in writing XQuery to find unresolved references in p5.xml. It strikes me that the converse should also be checked: find items in the bibliography that are no longer referred to. Those shouldn't necessarily be removed, but they might be candidates for removal.
Fwiw, here's my (rather rudimentary) XQuery:
declare namespace eg="http://www.tei-c.org/ns/Examples";
for $e in //*/@*[starts-with(.,'#')]
where not(//*[@xml:id = substring-after($e,'#')])
where not($e/ancestor::eg:egXML)
return $e/parent::*
Something like this could be run as a test on p5.xml to turn up broken links. Ideally, the output should be empty.
Here's an XQuery that finds <bibl>
and <biblStruct>
elements in the Bibliography (not in the Reading List portion) that are no longer linked to and should be candidates for updating or purging:
declare namespace t="http://www.tei-c.org/ns/1.0";
for $bib in (//t:div[@xml:id='BIB']//t:bibl/@xml:id | //t:div[@xml:id='BIB']//t:biblStruct[not(ancestor::t:div[@xml:id='BIB-RDG'])]/@xml:id)
where not(//*[@target = concat('#',$bib) or @source=concat('#',$bib)])
return $bib/parent::*
See https://github.com/TEIC/TEI/issues/1476#issuecomment-249831158, related (mostly inasmuch as it provides evidence that yes, such a check is a good idea).
Just ran the diagnostics XSL against a newly-built p5.xml and found the following:
Bad internal link: target: #ISO24611
Bad @xml:lang
values:
xml:lang: lat
xml:lang: lat
xml:lang: cornu
xml:lang: cornu
MIME types not listed in the IANA mime types list: mimeType: audio/wav mimeType: image/jpeg mimeType: audio/wav mimeType: image/gif mimeType: application/x-musescore mimeType: text/xsl
I'll check these out and fix if necessary.
I've fixed the first two lots, but the media types issue is a bit fraught; although the listed ones are not in the IANA media types listing page, they do show up in other locations, so I've left them alone. The "cornu" language value was wrong; "cornu" is a variant subtag meaning Cornish-English, and I'm pretty sure that the intent for the word in question was Middle Cornish, which has the subtag "cnx", so I've changed it to that.
For the broken pointer, I added a new item to the bibliography.
We can't close this ticket because we don't have a working automated solution yet, but we can consider it done for the upcoming GL release.
As far as @sydb and I can tell, there is no check run on the HTML of the Guidelines to test that all internal links are working. The Guidelines-Link-Check task on Jenkins checks the external links, but something should check that internal links are working. This could be run as part of the TEIP5-Documentation[-dev] job.