NVIDIA / NVFlare

NVIDIA Federated Learning Application Runtime Environment
https://nvidia.github.io/NVFlare/
Apache License 2.0
592 stars 165 forks source link

Add capability to publish metrics to prometheus #2684

Open chesterxgchen opened 1 month ago

chesterxgchen commented 1 month ago

Description

One of the feature request is to add system metrics to monitoring FLARE running metrics via Prometheus + Grafana or other monitoring systems.

In this PR, we propose components that can listen to the ReservedTopic.APP_METRICS topic and publish the metrics to the metrics server for Prometheus scraping. This plugin is optional, so the system can run with or without it. This PR doesn't provide the metrics, but add a capability to easily publish metrics. To illustrate this capability, we define a few metrics such as get_task, submit_update to make sure it works as expected.

image

Here are few pieces to make this work

1) MetricsCollector, this collector will subscribe a callback for the ReservedTopic.APP_METRICS topic in the DataBus; and receive callback when the topic is published.

In the callback, the MetricCollector simply post the received metrics from DataBus to prometheus http metrics server /update_metrics end point.

2) Develop a Prometheus HTTP Metrics Server: We developed a custom handler, which will take the newly updated metrics, dynamically define one if not found, update the metrics value. The prometheus client lib will automatically update every Prometheus metrics in a REGISTRY. Then by using start_http_server ( comes with Prometheus client lib), it automatically publish the REGISTRY to the /metrics end point for Proemethus Server scraping

With this two parts, we can simply any metrics record to be know to prometheus by

self.data_bus.publish([ReservedTopic.APP_METRICS], metrics_data)

Note, Once the metrics is published to /metrics endpoint, the prometheus server will retrieve ( scrap or HTTP GET) from the /metrics and displayed from Prometheus HTTP (default port 9090). This can be further used for Grafana as the data source and visualize. All we need to do is to start the prometheus server ./prometheus and start the Grafana with some configuration and we can visualize in Grafana.

3) to help collect user count, error count and time taken, we defined a time collection context: CollectTimeContext This is independent of Prometheus. We can use this

     try:
            with CollectTimeContext() as context:
                 your normal code
     finally:
            self.publish_app_metrics(context.metrics, metrics_group)

    def publish_app_metrics(self, metrics: dict, metric_group: str):
        metrics_data = {}
        for metric_name in metrics:
            label = f"{metric_group}_{metric_name}"
            metrics_value = metrics.get(metric_name)
            metrics_data.update({label: metrics_value})

        self.data_bus.publish([ReservedTopic.APP_METRICS], metrics_data)

CollectTimeContext collects count, error_count and time_taken metrics values for each action. before we publish for need flatten the label, values.

A few sentences describing the changes proposed in this pull request.

Types of changes