centerforaisafety / cerberus-cluster

HPC cluster code and configurations for running on OCI
Universal Permissive License v1.0
4 stars 0 forks source link

Develop oci metrics live scraper. #175

Closed ghost closed 4 months ago

ghost commented 1 year ago

To keep the OCI metrics we have in grafana up to date write a kubernetes service/app that uses the code written to export oci metrics into grafana in a scraping app. The base code can scrape any OCI metric- not just filestorage.

Dashboard in mind- https://graphs.cais.tools/d/c5e1cefd-2b43-4302-841d-3cc03dd34c5e/import-oci-metrics

ghost commented 11 months ago

Revamping the code to run as a service but need to backfill Aug 25 to now- then start the service

ghost commented 11 months ago

Doing a new run with the updated code of the metrics from the last 90 days to grab from OCI, convert to LP and push to Influxdb. https://cloud.oracle.com/object-storage/buckets/axvscsfozusv/metrics-exporter/objects?region=us-sanjose-1

ghost commented 11 months ago

Code refresh is done- and the graph is updated- k8s service is written just need to iterate and put it in prod.

Here's last 6 mo of FSS usage from our pulled metrics- it may look kinda weird in some spots as the last 90 were pulled again and the only resolution you can get for that far back isn't that large. It's impressive to see the FSS usage drop to almost nothing with the WEKA install.

ghost commented 11 months ago

Kubernetes service is updated- testing it and making sure it's secure.

ghost commented 11 months ago

I'm going to ditch the service in favor of just running the python script that pulls the metrics once a week via a kubernetes cronjob- no need to write/have a service running all the time.