NAICNO / Jobanalyzer

Easy to use resource usage report
MIT License
0 stars 1 forks source link

Saga-sized dashboard #291

Closed lars-t-hansen closed 8 months ago

lars-t-hansen commented 10 months ago

Sonar has been running on Saga for several years and we have lots of historical data. The data are old-style - untagged and without gpu information, and the cpu_pct field is not very useful because it has the wrong semantics - but we have them and we can create a Saga dashboard in Jobanalyzer. This will be a new challenge because Saga has many more nodes than Fox (364) and a linear list of nodes will not be practical.

EDIT: Most of this is working, see subsequent comments and PR #374. And data are coming in from Saga. Remaining tasks (evolving):

lars-t-hansen commented 8 months ago

372 nodes in these groups according to doc:

Node names appear to be c\d+-\d+, gpu-\d+-\d+, hugemem-\d+-\d+.

The ml/fox dashboard is a machine-centric view with the extra feature that nodes that are in trouble are sorted to the top. For bigger systems, we'll need to probably have a selection feature so that it's possible to look at a node when information about it is needed, but most of the time the dash needs to have some selection criteria and needs to show the nodes that fit those criteria. Criteria could be "trouble", "loaded", "idle", "any" maybe + perhaps a selector for the type of node + optionally a regex to match against node name? Ie the dash becomes a live search tool against the current state of the system. Node names would still be hyperlinked.

lars-t-hansen commented 8 months ago

Experimental code for this is live for Saga and Fox (currently Fox is the more impressive demo as we don't have data for Saga quite yet.)

lars-t-hansen commented 8 months ago

This is done. I'll spin off the matter of adding more aggregate plots as a separate low-pri bug.