Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
Dynolog provides system telemetry at Meta as well as in open source environments. Metric logging using Prometheus - an industry standard framework for logging/exporting metrics. This can also be leveraged by Meta AI Research super cluster and other open source infra based clusters.
Prometheus
Prometheus is an open source tool for metrics collection and publishing. One can use it to monitor metics remotely, graph them as well as integrate with Grafana for visualization.
A core concept in Prometheus is its data model. It consists of labels - a list of attributes of entities to associate with the metric (ex “ {nodename, gpu id}”), and metrics - numerical values that represent points in a time series..
Prometheus server runs on the box or node. Typically, it uses a pull model, obtaining the latest values of metrics and labels. (Visualized in diagram above)
TLDR
Dynolog provides system telemetry at Meta as well as in open source environments. Metric logging using Prometheus - an industry standard framework for logging/exporting metrics. This can also be leveraged by Meta AI Research super cluster and other open source infra based clusters.
Prometheus
Prometheus is an open source tool for metrics collection and publishing. One can use it to monitor metics remotely, graph them as well as integrate with Grafana for visualization.
Implementation
We can leverage the library https://github.com/jupp0r/prometheus-cpp/ that is straightforward to use.