Deploy sonar prototype to all ML nodes

lars-t-hansen commented 1 year ago

There are really a couple of different deployments to discuss, so this is a major metabug with lots of other bugs to be filed.

First, we need sonar to run and log samples. For this we need:

[x] determine a place to put the files ideally on a disk shared among all those nodes
[x] determine the cadence for sonar samples (5min is probably fine)
[x] create a sonar build for these nodes from current sources (it's OK to use the old, untagged file format)
[x] create some kind of logic for sonar to distribute the log records into the log file tree (it has to be that way eventually)
[x] possibly a special uid for sonar so that there is no risk of letting sonar be an attack vector on the system or on me, this uid should be available on all hosts
[x] directory tree for logging should belong to that uid
[x] cron job or similar, running under that uid, that triggers sonar occasionally, one of these per host

Second, we need to run sonalyze against the logs manually to test that. For this we need:

[x] sonalyze builds on the system (unless the disk is also shared with other nodes, but unlikely)
~~test against the logged data from a different user than the logging uid~~

Third, we want to run the analysis automatically and flag possible problems. For this we need:

[x] rules to trigger, se #15 for starters
[x] cadence under which to run the analysis (12hrs is probably fine)
[x] rules for how to send mail or other status reports,
[x] cron job
~~maybe the same uid as sonar, maybe another?~~

Final deployment task list, rough chronological plan:

[x] secure agreement from @Sabryr
[x] New web server
- [x] Set up static pages on new server
- [x] Set up upload to new server from current sonar setup
- [x] Test
- [x] Decommission old server
- [x] Document server setup
[x] Move analysis to moneypenny
- [x] Take down analysis cron job on ML4
- [x] Set up cron job to run everything on moneypenny as larstha, using larstha's directory
- [x] Update the uploader script to find the identity document in the right place
- [x] Disable fingerprint checking in the uploader (b/c not interactive, and the VM doesn't play ball)
- [x] Test: check that everything works with this setup
- [x] Define a sonar user/group on moneypenny with UID < 1000
- [x] Manually test everything
- [x] Cron job may need an explicit mailto? There will be email, at least from cpuhog / deadweight analysis. We should mail to itf-ai-support@usit.uio.no, and it'll go into the ticketing system in usit-ai-drift
- [x] Set up cron job to run everything
- [x] Document analysis setup
[x] #294
[x] define a new sonar user on all the ml nodes under whose uid the cron jobs run and the data are stored
[x] set up sonar directory structure on sonar user's account with all the necessary files and programs
[x] copy / duplicate cron jobs on all machines to the sonar user (probably) or to the system but running with sonar user's privileges (maybe)
[x] document entire architecture and leave documentation in this directory somehow

lars-t-hansen commented 1 year ago

Re "create logic for sonar to distribute the log", there is a shell script to invoke sonar here that we can modify for our use.

lars-t-hansen commented 1 year ago

Re "determine a place" and "special uid", for the prototype I'll just run sonar as myself and store data in a tree under my home directory. We can revisit when the bugs are ironed out.

lars-t-hansen commented 1 year ago

Re cron, I'm going to set up a cron job manually on all the systems for now.

lars-t-hansen commented 1 year ago

Sonar running on ml[1-4,6-8] under my user, let's give it until over the weekend...

lars-t-hansen commented 1 year ago

Everything's been running fine for ~6 weeks under my user, both monitoring and analysis jobs, with data uploads to the web server and email being pushed to admin (me). It's probably time to move this setup from my user to a separate user. I think I will keep my own setup as a staging area.

Task list above has been updated.

Sabryr commented 1 year ago

Hello @lars-t-hansen , Thank you very much for your professional approach.

We will ask for a no-login use for ML-nodes and for Fox. We need contact Bart to get the user on ML nodes and contact buZh on Fox.
When it comes to where to keep the data. Is it possible we push this to GitHub.uio.no or Gitlab.sigma2.no ? . May be keep past week data in a raw format and then update a monthly summary file. What do you think ?. if a shared storage is better then we can use EES shared mounts.

/itf-fi-ml/shared

Regards, Sabry

lars-t-hansen commented 1 year ago

@sabryr, I think it's a good idea to archive the data so that we can run long-range analyses, by and by. It's text and it compresses extremely well, so keeping compressed monthly (say) archives is not going to be a problem for anyone. (On the ML nodes I think we generate about 2MB of text every day, it compresses by > 90% IIRC.) It's actually easier to keep the raw data and regenerate all reports than it is to keep the reports or summaries, and not much more expensive - and it'll be much more flexible, as we evolve this system.

That said:

we need to discuss whether all the data we archive are suitable for being in a publically accessible place, I'm inclined to think that it's not
the on-line analysis code needs some data available for periodic runs, currently all data for the last quarter needs to be "live"

(For the latter point, the analysis code currently needs uncompressed data but there's no reason why I couldn't fix that, it's been on my radar for a time.)

The discussion definitely ties into where we will run the analyses, and into the shape this monitor will have on fox and (maybe) light-hpc systems.

NAICNO / Jobanalyzer

Deploy sonar prototype to all ML nodes #19