NAICNO / Jobanalyzer

Easy to use resource usage report
MIT License
0 stars 1 forks source link

Message-based transport of sonar data #294

Closed lars-t-hansen closed 8 months ago

lars-t-hansen commented 10 months ago

Without changing sonar we can move from disk-based to message-based transport. This will aid deployment on many systems and I think it's right to do this sooner rather than later.

Currently sonar prints its output to stdout and the shell script writes it to disk. I think that instead, the shell script should pipe it into a program I've written (*) called exfiltrate that pushes it (by HTTP POST currently but could be MQTT, see #285) to an agent off-system, we'd want to run this agent on the VM that's hosting the website. The agent would store the data locally and would run analyses on the data locally, this way we'd get rid of the cron jobs running on moneypenny and login-1.fox.

This system is a little more brittle than the shared disk but in the long run it's what we want anyway, so let's just bite the bullet and work on resilience in the implementation.

I think I'd like to do this before I do anything about moving sonar out of my account on the ML nodes, hence making this block M1.

(*) The program is working but it is missing a couple of optimizations and some resilience, both will be important.

The basic transport is working now. Remaining task list:

On the sending side

On the receiving side

Transfer the ML nodes to the new system:

Once that has baked for a while, we can do the same on Fox, though it'll likely be a little simpler

lars-t-hansen commented 10 months ago

To bring and keep the ingestion service up it'll be some systemd thing, see https://chris-vermeulen.com/using-systemd-to-run-a-simple-process-and-keep-it-up/ and https://wiki.archlinux.org/title/systemd/User for starters though some thinking is required to find the best form of it. Is it a user service or a system service?

lars-t-hansen commented 10 months ago

Re "Figure out how to get info about violators and hogs into the RT queue, now that MAILTO is not going to work well for that - we want MAILTO for cron failures, not for desired output", the most sensible thing to do would probably be to just add a -mailto option to naicreport, -mailto itf-ai-support@usit.uio.no, this would email all non-error output to that address with a suitable from: and subject line.

lars-t-hansen commented 8 months ago

A complication with the educloud VM is that though it is "open to the world" the available ports for TCP are very, very restricted and the list is controlled by Central Services. In particular, the ports I've been using are verboten. For the time being I can work around that using the ports that are left open for Oracle use. In the slightly longer term it'll be necessary to set up nginx proxies (probably) to forward requests to the services that are open on the machine but not visible from the outside.

lars-t-hansen commented 8 months ago

See https://github.com/NAICNO/Jobanalyzer/pull/331.

lars-t-hansen commented 8 months ago

I'm spinning off the systemd task and the remove-the-old-files task into separate bugs. Otherwise we're done: sonar on the ML nodes now stores no data on local disk and instead exfiltrates everything to naic-monitor.