aws-deepracer-community / deepracer-for-cloud

Creates an AWS DeepRacing training environment which can be deployed in the cloud, or locally on Ubuntu Linux, Windows or Mac.
MIT No Attribution
325 stars 176 forks source link

Add Telegraf/InfluxDB/Grafana compose stack for recording InfluxDB metrics #159

Closed mattcamp closed 2 months ago

mattcamp commented 3 months ago

This PR adds a docker-compose stack which launches three additional services

The feature is enabled by uncommenting DR_TELEGRAF_HOST and DR_TELEGRAF_PORT in system.env, which will be passed to Robomaker.

The Telegraf/Influxdb/Grafana stack can be started using dr-start-influxdb, after which the Grafana web UI can be accessed on port 3000.

Inherently this PR won't enable any additional metrics but is a pre-requisite to receive metrics from the updated robomaker via this PR

image

mattcamp commented 2 months ago

Looks overall good. What happens if Grafana or Influx-DB is not running if it is enabled? Would it be good to add some kind of pre-requisite of Grafana to the training? Or ensure it is started?

The initial metric to Telegraf is UDP, so it's effectively just blindly fired at a UDP port. If the telegraf container isn't running then there aren't any errors, it'll just fail to work. Such are the joys of UDP. This can make it a pain to debug if things aren't set up correctly, but also means near zero risk of breaking Robomaker, even with a totally misconfigured setup. Worst you should get is a DNS lookup error if you put something strange in for DR_TELEGRAF_HOST. But in nearly 100% of cases just setting it to telegraf should work fine, as long as the telegraf container is in the same docker network as robomaker.

If telegraf is running but Influx isn't then telegraf will error on container start as it verifies the Influx connection.

Grafana is just a presentation layer above influx for dashboards. If Influx isn't running then it will report a datasource error.

Only telegraf+influx are required to collect and store metrics.