coherentdigital / coherencebot

Apache Nutch is an extensible and scalable web crawler
https://nutch.apache.org/
Apache License 2.0
0 stars 0 forks source link

Develop a test tool for checking Org seed URLs and Published PDFs #12

Closed PeterCiuffetti closed 3 years ago

PeterCiuffetti commented 3 years ago

A fair number of sites in the Africa crawl failed due to robots exclusion or the lack of published PDFs.

The suggestion here is to build a test that checks if a seed URL in the Org CV

Robot Exclusion Test

There is a python library urllib.robotparser that will facilitate this robot test with a few lines of code.

>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
>>> rp.set_url("https://policycommons.net/robots.txt")
>>> rp.read()
>>> rp.crawl_delay("*")
>>> rp.can_fetch("GoogleBot","https://policycommons.net/artifacts/1381238/the-informational-content-of-ex-ante-forecasts/")
True
>>> rp.can_fetch("MauiBot","https://policycommons.net/artifacts/1381238/the-informational-content-of-ex-ante-forecasts/")
False
>>> rp.can_fetch("CoherenceBot","https://policycommons.net/artifacts/1381238/the-informational-content-of-ex-ante-forecasts/")
True

Has Published PDFs

For searching for PDFs at a site, the Google Custom Search API might have a solution.

See this stack overflow entry. And if the Google Search API does not find any hits on this file type, then either these were also blocked to GoogleBot or Google never got deep enough to find them. This might not be a conclusive test but it could create an exception list for review.

PeterCiuffetti commented 3 years ago

This is done and reports have been uploaded to Google Doc for the current seeds in use.

The python code for checking a URL might be usable in a validation step in the commons/collection editor

See: https://github.com/coherentdigital/coherencebot/blob/master/src/python/check_seeds.py#L143-L235