facebookincubator / dynolog

Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
MIT License
227 stars 34 forks source link

Prometheus Support for Metrics logging. #148

Open briancoutinho opened 1 year ago

briancoutinho commented 1 year ago

TLDR

Dynolog provides system telemetry at Meta as well as in open source environments. Metric logging using Prometheus - an industry standard framework for logging/exporting metrics. This can also be leveraged by Meta AI Research super cluster and other open source infra based clusters.

Prometheus

Prometheus is an open source tool for metrics collection and publishing. One can use it to monitor metics remotely, graph them as well as integrate with Grafana for visualization.

Implementation

We can leverage the library https://github.com/jupp0r/prometheus-cpp/ that is straightforward to use.