Quality checking of current content

IanMayo commented 2 years ago

We're working on the assumption that the legacy manual is of high quality, with all valid links, and all linked images in place.

That may be mistaken.

So, once we have our logic to "walk" the tree of data, we should check all links are valid. I guess this is a "crawl" of the whole site, starting from PD_1.html. We could also check that all pages get "visited", that there are no "orphans". Note: while we're using the "world map" as the "welcome page", there are actually some pages in front of that. Maybe we will switch to another "welcome" page, or maybe we'll crawl those in a separate process - depending upon how our future physical architecture turns out, I guess.

There is some simple python code to check the links on a page here: https://dev.to/arvindmehairjan/build-a-web-crawler-to-check-for-broken-links-with-python-beautifulsoup-39mg

I guess that our version would be iterative. A function receives a URL. It checks the URL exists. If it's a text-file, extract all the <a> links, and fire each into the function. The function would just report an error on the command line for broken/missing link destinations.

Hmm, this could grow, since for different types of page we could check different types of content are present. We assume a flag is present on each country page - but it would be useful to test that.

Aah, this will probably be useful in checking the consistency of the mock content that Ian is producing.

So, I guess we need to put a skeleton in place, and grow that.

Possible roles:

check for links where target is missing
check for files that aren't linked to (esp images)
produce lists of controlled vocabulary (to inform drop-down lists for some attributes). Generate to-do list of tidying
many of the tonal sources are abbreviations. Check the abbreviations are known (from the abbreviations page)
check expected content is present
Produce log of external (PDF/xls) files, list where they are linked from

Note: for the last item, we may need to invite human inspection. For that, maybe we'll do one of the following:

produce folder of blocks of inconsistent/duplicated content. A page for each instance, with copies of the relevant content, plus links to the relevant file(s)
produce tabular report with list of nature of inconsistency, plus link(s) to content

rnllv commented 2 years ago

Acknowledged.

rnllv commented 1 year ago

There is some simple python code to check the links on a page here: https://dev.to/arvindmehairjan/build-a-web-crawler-to-check-for-broken-links-with-python-beautifulsoup-39mg

This code is plain wrong. The response_code used for the child hrefs is the same one used for the parent page, and not refreshed for each href in the parent page. The link.get call just returns the href url, and does not make a http call.

I'm searching for other open source implementations that we can reuse. We'll also need to handle relative and absolute paths in the href.

rnllv commented 1 year ago

This looks to be more reliable. Checking further. https://realpython.com/beautiful-soup-web-scraper-python/

IanMayo commented 1 year ago

@rnllv - I've been thinking about this.

We wish to do quality checking of the parsed data dump to verify things are (or aren't) present in the results set. We can't use real entities. But, we could do this:

give each unit test a sequential number (or use the issue number)
include a phrase in the content that includes this, but is easily found (%TEST-23%). We inject these values into the mock/real data for relevant content. Note: we're adding that to data, not replacing data with it. So, we have to check relevant strings include that, not equal it. At the end of the parse phase, these markers get removed.

rnllv commented 1 year ago

@IanMayo, would we be really needing the above for all test cases?

Consider the commit: e0671346fbbef0d8c419e574cbb3a4d246a131c0

Here we know what we're looking for, and have specifically tested for that.

IanMayo commented 1 year ago

would we be really needing the above for all test cases?

No, we wouldn't need that strategy for all test cases - just ones that require insertion of some specific data in the content.

IanMayo commented 1 year ago

Our DITA-OT process does a good job of checking content. Let's defer this for a future phase.

DeepBlueCLtd / LegacyMan

Quality checking of current content #7