coherentdigital / coherencebot

Apache Nutch is an extensible and scalable web crawler
https://nutch.apache.org/
Apache License 2.0
0 stars 0 forks source link

Research and Document Elastic Map Reduce sizing, cost of operation and other dev ops parameters #7

Closed PeterCiuffetti closed 3 years ago

PeterCiuffetti commented 3 years ago

I need to do a little more research and documentation on managing EMR clusters from a dev ops perspective.

Among the things I do not know how to do:

I should also study the cost of operation a little more closely so we can budget the cost of CoherentBot on a cost-per-net PDF harvested. Then if a new size with x thousand documents is added to the list, we can estimate what it costs on a per-site basis.

There are many other knobs an levers we should consider being required down the road

PeterCiuffetti commented 3 years ago

Marking this as complexity M ... its probably 2 or 3 days of research, setup, trial and error. And I also want to document what I learn so others can take on CoherenceBot dev ops tasks.

PeterCiuffetti commented 3 years ago

See Google Doc "CoherenceBot Dev Ops" https://docs.google.com/document/d/14eQ80aA-m5x30dUeh2hiJWx29eZeBZpJVGloV_9k0As/edit#