Develop a test tool for checking Org seed URLs and Published PDFs

A fair number of sites in the Africa crawl failed due to robots exclusion or the lack of published PDFs.

The suggestion here is to build a test that checks if a seed URL in the Org CV

Is reachable
Can be crawled by CoherenceBot.
Has published PDFs

Robot Exclusion Test

There is a python library urllib.robotparser that will facilitate this robot test with a few lines of code.

>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
>>> rp.set_url("https://policycommons.net/robots.txt")
>>> rp.read()
>>> rp.crawl_delay("*")
>>> rp.can_fetch("GoogleBot","https://policycommons.net/artifacts/1381238/the-informational-content-of-ex-ante-forecasts/")
True
>>> rp.can_fetch("MauiBot","https://policycommons.net/artifacts/1381238/the-informational-content-of-ex-ante-forecasts/")
False
>>> rp.can_fetch("CoherenceBot","https://policycommons.net/artifacts/1381238/the-informational-content-of-ex-ante-forecasts/")
True

Has Published PDFs

For searching for PDFs at a site, the Google Custom Search API might have a solution.

See this stack overflow entry. And if the Google Search API does not find any hits on this file type, then either these were also blocked to GoogleBot or Google never got deep enough to find them. This might not be a conclusive test but it could create an exception list for review.

coherentdigital / coherencebot

Develop a test tool for checking Org seed URLs and Published PDFs #12

Robot Exclusion Test

Has Published PDFs