coherentdigital / coherencebot

Apache Nutch is an extensible and scalable web crawler
https://nutch.apache.org/
Apache License 2.0
0 stars 0 forks source link

Optimize the role of HTML URLs in CoherenceBot #9

Open PeterCiuffetti opened 3 years ago

PeterCiuffetti commented 3 years ago

In the current implementation of CoherenceBot, we are not allowing HTML reports into Policy Commons. While there are many publishers who publish only in this format, there's simply too much possibility for noise documents landing in Policy Commons. So until we have a more sophisticated selection process, we will be ignoring HTML reports.

However we do need HTML as a source of links to PDFs, so we need to process all the descendants of the seed URLs, HTML and PDF, and look for outbound links with the same prefix.

We also need to retain the URL of HTML pages so we can revisit them when they expire to see if new links are added.

The main motivation is to reduce the storage consumed by CoherenceBot on the Hadoop file system. The more info about HTML we save, the more quickly this storage will fill up with unusable content.

So this task is to adjust CoherenceBot handling of HTML. In needs to parse for outbound links and that is it. It needn't store the HTML content or metadata after this parsing because the outbound links become their own entries in the Nutch database. It's not certain if existing functionality will permit this special treatment, so some research is needed.

PeterCiuffetti commented 3 years ago

Marking this as a small. If it isn't a small, I wouldnt bother. It mainly a defensive mechanism to help avoid overflowing hadoop.