coherentdigital / coherencebot

Apache Nutch is an extensible and scalable web crawler
https://nutch.apache.org/
Apache License 2.0
0 stars 0 forks source link

Export collection metadata to S3 directly from CoherenceBot #11

Open PeterCiuffetti opened 3 years ago

PeterCiuffetti commented 3 years ago

CoherenceBot currently exports its results to an ElasticSearch index. It is then publishes to S3 using ES queries that select publishable material, divides it up by collection, and formats it for import into the Commons. It also runs a validator on the export file created before writing it.

The goal of this ticket is to perform this step inside CoherenceBot.

This could be achieved by what Nutch calls an 'Indexer Plugin'. It already has several of these, for SOLR, ElasticSearch and other types of search engines that Nutch might be feeding data to. These prepare batch updates that get pushed to the search engine's update API. In Nutch's Elastic Search Indexer Plugin, these updates are in JSON so its already very similar to the exports coming out of ElasticSearch in the current implementation. These plugins also know how to delete documents that have gone missing.

We could make an indexer plugin for S3 that deposits collection files in JSON (and delete files?) to S3.

Or this could be achieved with what Nutch calls an "Index Filter Plugin". This is something that runs prior to the search engine update, typically to either discard unwanted documents or to enhance them. I have already several index plugin doing work like adding Org metadata to the index document (since this comes from a different source that the harvested page). I have another index filter that discards non PDF and small PDFs.

So we can probably split this job between two plugins. The Index Filter Plugin could be the one checking that the document does not contain unwanted terms in the english title. And the S3 Indexer Plugin could write the files to S3.

One complication is the need to divide the exports into per-collection files. While I think this is achievable, I'm not sure if the files will be numerous and small. Does this make a difference?

Another feature to consider is a mechanism for adjusting the keyword exclusion list while CoherenceBot is running. I believe the nutch config files get statically loaded so Im not sure if changing them on disk results in a reload in the running cluster. In the worst case the crawlers will have to be paused, updated and restarted to update the exclusion list.

And finally, we need to decide if we want to maintain the CoherenceBot index now in place, which would required having two indexer plugins active, one for updating S3, the other for updating coherence-orgs' index. Alternatively another (non-coherencebot) task could take care of updating this index from the S3 outputs.

As mentioning in the dev ops ticket #7 - this plugin will need to be able to operate across AWS regions, to store results in a central S3 bucket. So the implications of this need to be understood. (An Elastic Search indexer plugin mught have the same obstacles to overcome to receive updates from multiple regions).

We should also agree on the directory naming conventions in S3 to come up with a convention we are comfortable manging.

Finally we need to decide how deletes occur, and how updates occur. A delete could be mainly an update too-- which changes the view URL to an artifact_archival location.

PeterCiuffetti commented 3 years ago

This will have to be redeveloped in Java, so it's on the edge of a medium, But the working Python version is about 600 lines of code including comments.