Closed IanMayo closed 1 year ago
Acknowledged.
There is some simple python code to check the links on a page here: https://dev.to/arvindmehairjan/build-a-web-crawler-to-check-for-broken-links-with-python-beautifulsoup-39mg
This code is plain wrong. The response_code
used for the child hrefs is the same one used for the parent page, and not refreshed for each href in the parent page. The link.get
call just returns the href url, and does not make a http call.
I'm searching for other open source implementations that we can reuse. We'll also need to handle relative and absolute paths in the href.
This looks to be more reliable. Checking further. https://realpython.com/beautiful-soup-web-scraper-python/
@rnllv - I've been thinking about this.
We wish to do quality checking of the parsed data dump to verify things are (or aren't) present in the results set. We can't use real entities. But, we could do this:
%TEST-23%
). We inject these values into the mock/real data for relevant content. Note: we're adding that to data, not replacing data with it. So, we have to check relevant strings include
that, not equal it. At the end of the parse
phase, these markers get removed.@IanMayo, would we be really needing the above for all test cases?
Consider the commit: e0671346fbbef0d8c419e574cbb3a4d246a131c0
Here we know what we're looking for, and have specifically tested for that.
would we be really needing the above for all test cases?
No, we wouldn't need that strategy for all test cases - just ones that require insertion of some specific data in the content.
Our DITA-OT process does a good job of checking content. Let's defer this for a future phase.
We're working on the assumption that the legacy manual is of high quality, with all valid links, and all linked images in place.
That may be mistaken.
So, once we have our logic to "walk" the tree of data, we should check all links are valid. I guess this is a "crawl" of the whole site, starting from
PD_1.html
. We could also check that all pages get "visited", that there are no "orphans". Note: while we're using the "world map" as the "welcome page", there are actually some pages in front of that. Maybe we will switch to another "welcome" page, or maybe we'll crawl those in a separate process - depending upon how our future physical architecture turns out, I guess.There is some simple python code to check the links on a page here: https://dev.to/arvindmehairjan/build-a-web-crawler-to-check-for-broken-links-with-python-beautifulsoup-39mg
I guess that our version would be iterative. A function receives a URL. It checks the URL exists. If it's a text-file, extract all the
<a>
links, and fire each into the function. The function would just report an error on the command line for broken/missing link destinations.Hmm, this could grow, since for different types of page we could check different types of content are present. We assume a flag is present on each country page - but it would be useful to test that.
Aah, this will probably be useful in checking the consistency of the mock content that
Ian
is producing.So, I guess we need to put a skeleton in place, and grow that.
Possible roles:
Note: for the last item, we may need to invite human inspection. For that, maybe we'll do one of the following: