cylc / cylc-admin

Project planning for the Cylc Workflow Engine.
https://cylc.github.io/cylc-admin/
GNU General Public License v3.0
5 stars 13 forks source link

cylc server monitor #72

Open oliver-sanders opened 4 years ago

oliver-sanders commented 4 years ago

Writing this up here as it's uncertain where this code would live.

At the Met Office we have a pool of ~12~ 16 servers which suites can run on. To help us keep track of the health of these servers and the usage of Cylc on them we wrote a tool which provides a web dashboard with:

This is important functionality for larger sites, there are lots of ways in which is can be improved (e.g. daily job counts).

This code is written in Python2.6 and has to run in a bare environment so is kinda ugly and not especially portable. It needs a re-write!

We should be able to re-implement this functionality within the Cylc UI/UI-Server infrastructure to provide an admin dashboard. That way this functionality would ship with Cylc and be available to all.

This would involve the creation of a dashboard for Cylc admins (we could make it accessible to all users), it would require an always-running UI Server running under a specified account, which, depending on site specifics may require certain privileges to be effective. It will need to maintain a database, sqlite3 is more than sufficient.

Infrastructure aside the actual code component is pretty simple:

Infrastructure wise:

kinow commented 4 years ago

I think for the UI we can actually leverage from existing tools.

Graphite, Prometheus, Grafana, and so many other tools are able to digest this sort of information.

Our dashboard could then have dummy components that simply use these other libraries - or we could even simply use the tools in the UI.

These tools are also common in cloud deployments, so if the server side is able to produce a JSON document in the format for prometheus (for example) users woulf be able to choose their monitoring and even alerting solution.

Just my 0.02 cents, but great idea and should be fun to implement.

oliver-sanders commented 4 years ago

Lots of fun plotting libraries we could use, interesting point on alerting, the old system does this in the Python backend.

hjoliver commented 4 years ago

Grafana etc. are really nice; we should certainly look at using something like that (in due course) since you say a rewrite is needed anyway.

oliver-sanders commented 4 years ago

We could potentially keep the old frontend but it wouldn't take long to re-write so lets do it properly!

Some screenshots of the old frontend for reference:

exvcylc01 exvcylc02 exvcylc03

Some issues hanging over from the old system transcribed from the old issue tracker (sticky notes on my desk):

Some screenshots of a Python3 CLI utility which works with the JSON dump files produced by the old system:

$ suitetool3 --latest
# 732 rows in dataset

0  Add field     Add derived field to the data set.               
1  Filter        Filter by field value.                           
2  View          Print all data                                   
3  Summary       Print the first few rows of data.                
4  Count         Count unique values for a given field.           
5  Debug         Insert pdb breakpoint                            
6  Export Data   Export the current dataset as a CSV file.        
7  Email Users   Send an email to all users present in the dataset
8  Stack Action  (undo, export, import)                           
9  Exit                                                           

Choose an action (int): 0

0  suite_dir       The FS location of the suite directory.               
1  root_dir        The FS mount which the suite is installed on.         
2  shared_account  True if the account is *likely* to be a shared account
3  suite_grep      Grep *.rc files against a pattern.                    
4  diff            Diff suites present at another checkpoint.            
5  cylc_tags       Tuple of taggs for the cylc_version                   
Choose a field (int): 1
[=============================================================================]
[=============================================================================]

# 732 rows in dataset

0  Add field     Add derived field to the data set.               
1  Filter        Filter by field value.                           
2  View          Print all data                                   
3  Summary       Print the first few rows of data.                
4  Count         Count unique values for a given field.           
5  Debug         Insert pdb breakpoint                            
6  Export Data   Export the current dataset as a CSV file.        
7  Email Users   Send an email to all users present in the dataset
8  Stack Action  (undo, export, import)                           
9  Exit                                                           

Choose an action (int): 4
Available fields "server, suite_id, user_name, user_id, cylc_version, memory, cpu, run_days, last_activity, suite_dir, root_dir"
Choose field: root_dir

field: root_dir
unique items: 4

frequency
---------
/net/home           413
/net/data           289
/net/spice/scratch  29 
/net/spice/project  1  

items
-----
/net/data|/net/home|/net/spice/scratch|/net/spice/project
oliver-sanders commented 1 year ago

The neatest way to implement this is likely as a jupyter-hub service.

This will allow us to run the extension with the hub account privileges if necessary and provide integration with cylc hub. The most obvious place for the code to live is the cylc-uiserver repository (we can omit it from the standard installation using an optional dependencies if desired).

The service would scape Cylc processes from ps listings (e.g. via psutil) and store the results in a housekept sqlite3 db or in raw data files. It would register endpoints exposing this data for a light-weight web-app.

oliver-sanders commented 1 year ago

This is worth a look, someone worked out how to "proxy" graphana as a Jupyter Hub service - https://github.com/rcthomas/jupyterhub-prometheus-grafana

hjoliver commented 1 year ago

(For the record, now using the original "exvcylc" monitor at NIWA, it's super helpful).