ISISComputingGroup / IBEX

Top level repository for IBEX stories
5 stars 2 forks source link

GUI: Add Health check; including Block Server is not running #3752

Open John-Holt-Tessella opened 5 years ago

John-Holt-Tessella commented 5 years ago

As a VESUVIO instrument scientist I want to know if either my block or instrument archiver is not running so that I can take remedial action.

This information in cycle is alerted on by Nagios but the instrument scientist would like to know.

Preferably something similar to the error users see when the block server is not up.

NB Check should be that it is healthy not the process is around.

Acceptance Criteria

  1. Finish defining the acceptance criteria with what we expect
  2. Consider the distribution methods
  3. Nagios may be enough on it's own
FreddieAkeroyd commented 5 years ago

Would they like more than just a GUI warning?

John-Holt-Tessella commented 5 years ago

I don't think so no.

ChrisM-S commented 5 years ago

Is there a particular use case for this? I'm a bit puzzled at this deconstruction of the server, canthe IBEX server be "running" successfully but still have the archive and/or block servers not running? Might it be better to simply "alert" that the server is not running correctly (an OR of the main indicators we see on Nagios) and then link to individual server alarm statuses which might be used to diagnose a fault (or even fix it automatically) and suggest a restart of the server?

FreddieAkeroyd commented 5 years ago

Agreed, the scientist probably doesn't care which component is not running, they just need to know things are not well somewhere. INSTETC is important too for example, so a combination of key processes statuses (as per in nagios) would be most useful to them

ThomasLohnert commented 5 years ago

Duplicate (sort of, based on the discussion here): https://github.com/ISISComputingGroup/IBEX/issues/1478

John-Holt-Tessella commented 5 years ago

OK Can we make this:

Also the implementation of this may be through a PV(s) in the block server itself.

John-Holt-Tessella commented 5 years ago

Not a duplicate of #1478 this is just for the server itself with no GUI running. Athough they could and should be backed by the same mechanism,

GDH-ISIS commented 5 years ago

Does the block archive still re-start at the end of every run?

FreddieAkeroyd commented 5 years ago

@John-Holt-Tessella don't we need to cover more than just the archivers to determine correct system state? And we need to be careful not to confuse a user too much with details.

Tom-Willemsen commented 5 years ago

We need to get a generalised error message to the server.

ChrisM-S commented 5 years ago

There is a sort of Russell’s Paradox of the server trying to monitor itself – particularly for faults which it might it might not be able to flag because of the fault.

The client on the other hand is in an ideal position (like Nagios) to just look for things (PVs mainly) which should be there and if not it can flag the bit(s) which are missing/faulty (or make some logical deductions). It can comfortably spot things by omission – no block PVs?, must be a problem with Blocks, no archive PVs, must be an issue with the archiver, no server responses at all, no server running.

PS Liked Thomas’ simple traffic light style system in #1478. If the client showed something simple like this, the purpose of this current ticket would be served well.

FreddieAkeroyd commented 5 years ago

The following are important components of IBEX and you may want to know if they are not there:

KathrynBaker commented 5 years ago

Given that list which is a mix of IBX server and non-server (as in it isn't part of IBEX directly such as mysql) items what we probably really need is each server to have a monitoring service which runs independently, this service can then supply the information to ANY clients on startup/request - so the GUI could have a set of traffic lights in an OPI, genie_python could check on this when certain commands are run, the dataweb could have some traffic lights too. The clients would then be looking and can say "monitoring down" if it can't get to the service, and can display the missing items if it can get to the service. Avoids the server checking the server, does not put system logic into the client (which might not be running) - and can probably easily reuse the Nagios checks

John-Holt-Tessella commented 5 years ago

Idea from @Tom-Willemsen: Include this as part f config checker so we get warning here what else would we do

KathrynBaker commented 3 years ago

Point after the meeting

kjwoodsISIS commented 3 years ago

How many points?