GUI: Add Health check; including Block Server is not running

John-Holt-Tessella commented 5 years ago

As a VESUVIO instrument scientist I want to know if either my block or instrument archiver is not running so that I can take remedial action.

This information in cycle is alerted on by Nagios but the instrument scientist would like to know.

Preferably something similar to the error users see when the block server is not up.

NB Check should be that it is healthy not the process is around.

Acceptance Criteria

Finish defining the acceptance criteria with what we expect
Consider the distribution methods
Nagios may be enough on it's own

FreddieAkeroyd commented 5 years ago

Would they like more than just a GUI warning?

John-Holt-Tessella commented 5 years ago

I don't think so no.

ChrisM-S commented 5 years ago

Is there a particular use case for this? I'm a bit puzzled at this deconstruction of the server, canthe IBEX server be "running" successfully but still have the archive and/or block servers not running? Might it be better to simply "alert" that the server is not running correctly (an OR of the main indicators we see on Nagios) and then link to individual server alarm statuses which might be used to diagnose a fault (or even fix it automatically) and suggest a restart of the server?

FreddieAkeroyd commented 5 years ago

Agreed, the scientist probably doesn't care which component is not running, they just need to know things are not well somewhere. INSTETC is important too for example, so a combination of key processes statuses (as per in nagios) would be most useful to them

ThomasLohnert commented 5 years ago

Duplicate (sort of, based on the discussion here): https://github.com/ISISComputingGroup/IBEX/issues/1478

John-Holt-Tessella commented 5 years ago

OK Can we make this:

On block archiver not running and block server is running GUI shows pop-up modal message box reading "IBEX server component not running. Blocks will not be written to you nexus file. Please restart the server". This should be acknowledgeable. There should be an error on the banner saying "server error".
On instrument archiver not running and block server running GUI shows pop-up modal message box reading "IBEX server component not running. PVs, not block, will not be archived. Please restart the server". There should be an error on the banner saying "server error".

Also the implementation of this may be through a PV(s) in the block server itself.

John-Holt-Tessella commented 5 years ago

Not a duplicate of #1478 this is just for the server itself with no GUI running. Athough they could and should be backed by the same mechanism,

GDH-ISIS commented 5 years ago

Does the block archive still re-start at the end of every run?

FreddieAkeroyd commented 5 years ago

@John-Holt-Tessella don't we need to cover more than just the archivers to determine correct system state? And we need to be careful not to confuse a user too much with details.

Tom-Willemsen commented 5 years ago

We need to get a generalised error message to the server.

ChrisM-S commented 5 years ago

There is a sort of Russell’s Paradox of the server trying to monitor itself – particularly for faults which it might it might not be able to flag because of the fault.

The client on the other hand is in an ideal position (like Nagios) to just look for things (PVs mainly) which should be there and if not it can flag the bit(s) which are missing/faulty (or make some logical deductions). It can comfortably spot things by omission – no block PVs?, must be a problem with Blocks, no archive PVs, must be an issue with the archiver, no server responses at all, no server running.

PS Liked Thomas’ simple traffic light style system in #1478. If the client showed something simple like this, the purpose of this current ticket would be served well.

FreddieAkeroyd commented 5 years ago

The following are important components of IBEX and you may want to know if they are not there:

block and inst archiver
block gateway
dae ioc
instetc ioc
procservcontrol ioc
mysql
block server
database server
alarm server
runctrl ioc

KathrynBaker commented 5 years ago

Given that list which is a mix of IBX server and non-server (as in it isn't part of IBEX directly such as mysql) items what we probably really need is each server to have a monitoring service which runs independently, this service can then supply the information to ANY clients on startup/request - so the GUI could have a set of traffic lights in an OPI, genie_python could check on this when certain commands are run, the dataweb could have some traffic lights too. The clients would then be looking and can say "monitoring down" if it can't get to the service, and can display the missing items if it can get to the service. Avoids the server checking the server, does not put system logic into the client (which might not be running) - and can probably easily reuse the Nagios checks

John-Holt-Tessella commented 5 years ago

Idea from @Tom-Willemsen: Include this as part f config checker so we get warning here what else would we do

KathrynBaker commented 3 years ago

Point after the meeting

kjwoodsISIS commented 3 years ago

How many points?

ISISComputingGroup / IBEX

GUI: Add Health check; including Block Server is not running #3752

Acceptance Criteria