coherentdigital / coherencebot

Apache Nutch is an extensible and scalable web crawler
https://nutch.apache.org/
Apache License 2.0
0 stars 0 forks source link

Deploy multiple CoherenceBot clusters, one per major region #6

Closed PeterCiuffetti closed 3 years ago

PeterCiuffetti commented 3 years ago

Set up Elastic Map Reduce clusters in the following regions:

The number of URLs discovered within each of these regions will probably not be proportional, so US and EU might need more than one.

This will require coming up with a way to distribute the seed URLs to each cluster and avoid duplication.

And we will need a deploy mechanism so each cluster can be updated without too much hassle.

It will also require exporting results back to the centralized S3 bucket in us-east-2, so check that this path is achievable.

PeterCiuffetti commented 3 years ago

Marking this complexity M mainly because you know AWS is going to throw some funky shite at me dealing with cross-datacenter permissions.

PeterCiuffetti commented 3 years ago

We have clusters in US (Ohio), EU (Frankfort) and AP (Tokyo) and I have tested these with small seed lists.