Closed gonzalomerino closed 7 years ago
Things to watch for:
Followup with @gonzalomerino on more concrete functionality. Maybe break into multiple tickets.
Yes, may be we can break into multiple tickets, as we see need as we discuss more about this...
A possibility could be:
1) site health monitoring information: checks at glidein startup (NOTE: is this issue #104? )
Part of this could be try to build a "site status monitoring page" that shows sites where basic things work as OK/Green and sites with basic failures as BAD/Red.
One convenient place to do this basic tests is at the start of the glidein. The idea would be to come up with a list of basic tests, and then find out the best tools to ship output information back to a central dashboard. Graphite could be an option if we want to ship simple numeric data, but I think it might be nice to be able to ship back short text snippets with error messages and so. May be some sort of messaging system that can work with elasticsearch? (zeromq?)
Basic tests that we could do:
network connectivity test: check the basic needed ports
CVMFS
run a simple IceTray job (1 or 2 min duration)
the idea is that the glidein would run these at start up and send back a report (json message I guess) with the output: OK,OK,OK or OK,FAIL,FAIL along with error messages.
something like this.
2) client.py monitoring (or pre-glidein): here, it would be interesting to monitor things like the number of running, queued glideins, as well as the max queued time for queued glideins. Client.py heartbeat monitoring could also be part of this.
3) glidein slot monitoring (startd monitoring... sort of): Some ideas related to this...
I hope we can monitor quite a lot of this using standard condor_status slot information shoveled into graphite or elasticsearch via FIFEMON, or our extensions to FIFEMON. One goal for this would be to have monitoring for the "unmatched time" for slots. Basically, to detect glideins that run and sit idle for 20min without work and then decide to stop.
I do not know if this is 100% related to this, but in the past we have talked about having the ability to bring STARTD logs back to a central place so that we could debug issues.
One key question is whether we're making graphite and elasticsearch world writable, or keeping them confined to WIPAC. We can dump directly to them, but that might be a security issue (especially since these aren't backed up).
I do not know much about designing scalable distributed systems, but it sounds to me that just opening the graphite port to to the world, and having potentially thousands of processes sending data there from anywhere in the world, is going to have some issues.
Isn't this a classic use case for a messaging system? Can we consider using something like apache kafka for sending information to logstash/ES in a reliable way?
(note that I am talking here about point nr. 1 in my previous comment, my question still is: is this issue #104 ? if so, we should move this discussion there)
Let us split this ticket in 3 (and close this ticket) to try and be more focused:
1) monitoring from client.py: issue #110
2) monitoring for glidein startup (before the startd connects): issue #104
3) glidein monitoring post-startd: issue #111
Ideally, we would like to have a monitoring dashboard that would show us if things are OK or BAD at every site.
This way, a daily check would tell us which sites might have config issues, or black hole nodes. Or if everything shows as RED, we will know that we have a problem in the server side (glidein server, condor CM, gridftps at madison, etc)
We need to find the best way to send back this monitoring information. May be try to use graphite? ... or/and logstash/elasticsearch?