OSC / ood-documentation

Documentation for Open OnDemand generated using Sphinx
https://osc.github.io/ood-documentation/latest/
MIT License
10 stars 53 forks source link

updates around monitoring #780

Open johrstrom opened 1 year ago

johrstrom commented 1 year ago

Some suggestions from Alan -

  1. Is it possible / of value to the community to add a section under the "Logging" section called "Splunk Analytics", that provides some of the example Splunk queries / reports we use for the monthly meetings?
  2. Can we add a section to cross link / point to the XDMoD Integration section "https://osc.github.io/ood-documentation/develop/customizations.html#xdmod-integration"... although I think that currently focuses on just what shows up inside of OOD. Maybe we need a few screenshots showing what types of info is available within the XDMoD interface not from a individual user job level but more the system level stuff.
  3. We previously utilized Nagios instead of Prometheus. I assume there is some documentation still out there about that? Lots of sites still utilize Nagios so if we have any docs it would be nice to include that too.

┆Issue is synchronized with this Asana task by Unito

treydock commented 1 year ago

Also be good to cross reference the XDMOD module to parse OnDemand usage logs: https://ondemand.xdmod.org/10.0/overview.html. There is some tooling at OSC to make that process easier, most of it is here: https://github.com/treydock/puppet-module-xdmod/tree/master/templates/ondemand

WRT Nagios, that's actually a lot harder to document and be useful than Prometheus as Prometheus it's pretty easy to share alerts and configs with other sites but Nagios is configured so many different ways and the way OSC used it was kind of complicated and hard to follow. The most we might have done was monitor that Apache was online , nothing really specific to OnDemand itself.