Documentation on monitoring

johrstrom commented 4 years ago

We have no documentation on how to monitor OOD. Not even references to our own ganglia or prometheus exporter or base apache monitoring.

┆Issue is synchronized with this Asana task by Unito

treydock commented 4 years ago

Once https://github.com/OSC/ondemand/pull/400 is merged that will add the support for Grafana and can document both Grafana and Ganglia but that just covers integrating with monitoring.

The existing ondemand-specific monitoring is mostly around PUNs and Apache connections. The rest isn't specific to OnDemand but rather just checking filesystems aren't full, ports are open, Apache responds to requests and certificates aren't expired. With Prometheus we can also more easily monitor memory and CPU levels to keep an eye for spikes in those on OnDemand host. I suppose we can cover what we provide as well as ideas of what else to monitor.

ericfranz commented 4 years ago

I'd like to see our approach to this redesigned first. The idea is mentioned to a degree in https://github.com/OSC/ood-documentation/issues/235, but we would change the app so the AJAX request for the job details returns HTML for the job details pane, instead of JSON. Once HTML rendering is done server side, we could have a view template partial similar approach that we can override with a custom one in /etc/ood/config. That way we could embed site specific logic, like for example, "if this job is on pitzer and the job's native attribute has something about gpus, lets display graphs for GPUs". Something the current abstraction in the app doesn't support.

At that point we could move the bulk of our custom logic to these custom views, removing this functionality from ActiveJobs. That is when I would like to document this feature.

We could add this work to the 1.8 release.

OSC / ood-documentation

Documentation on monitoring #235