Aleph2 import/analytics maintenance mode

Alex-Ikanow commented 8 years ago

The idea is to stop all harvesters and all analytic engine triggering when invoked

So a clean cluster shutdown would look like:

Set maintenance mode
Stop the API (any way we can do better than "404ing"? I think the UI might be usefully written for this, eg when maintenance mode is invoked then it knows to treat "AJAX" failures as "still in maintenance mode" and display some useful message - eg a string registered in the maintenance mode)
Wait some time period (5(?) minutes) for existing jobs to work their way through the system
- Should we have some "have all batch jobs finished" flag?
- I'm assuming we'd leave the real-time enrichment/analytics jobs running, and let the underlying technologies manage that? Or maybe specify in the flag which analytic/enrichment technologies to restart and which to leave

Then once the maintenance is complete, unflag the maintenance mode (and restart the API), and it restarts all the harvester jobs, which for all correctly written harvesters (*) should

For v1.5 clusters, there'd also be the following 3 steps:

Turn harvesters off
Create the STOPALL touch file in /opt/infinite-home/bin_
Stop logstash

(And the reverse when restarting)

(*) note obviously any passive streaming harvesters would not work, eg if you're sat on UDP syslog. If the node is getting rebooted then there's not much you can do there (I suppose for rolling restarts you could be duplicating data across 2 instances of the harvester, which then used ZK to decide which actually did anything with the data. So there are potential workarounds, but in the short term at least requiring "zero loss" scenarios to use file based transport seems better)

Alex-Ikanow commented 8 years ago

@robgil mentioned for thoughts

robgil commented 8 years ago

This sounds great. A couple minor thoughts..

Make this feature API driven so it can be scripted to put it in maintenance mode from a cli script (using valid authentication)
- API or not, would it be run on all nodes, or would one node be suitable?
Status page to see what jobs are still running or which ones need to be manually turned off, or restarted etc.
Might be better to be a database row rather than a file to block services from picking up. It would make cleanup easier, although cause a query for each job run to see whether its in maintenance mode or not (should be cached and not really introduce much performance impact as its only checked at the beginning of a job run). If status is maintained for a status page, the status of jobs will probably need to be updated in the DB also.

IKANOW / Aleph2

Aleph2 import/analytics maintenance mode #63