aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
205 stars 86 forks source link

pin prometheus to version 2.55.1 #504

Closed nghtm closed 1 week ago

nghtm commented 1 week ago

Issue #, if available: Prometheus 3.0, released 11/14, removes the ability to use --enable-feature=agent --storage.agent.path="/opt/prometheus/data-agent flag which is implemented in the prometheus.service file by install-prometheus.sh (executed on HP controller node).

This leads to an error when the latest prometheus 3.0.0 version is pulled, with --enable-feature=agent path not recognized.

2 Alternatives explored, going with the lower risk option 2 until option 1 can be better understood.

1/ continuing to pull latest version and replacing with --storage.tsdb.path="/opt/prometheus/data" confirmed to work and successfully install prometheus correctly, however removing the remote-agent leads to risk of disk space filling up on controller. Until the risks of this new behavior are better understood, this option leads to too much risk.

2/ Pin the prometheus version to 2.55.1, which is known to work at x1000 node cluster scale.

Both options have been validated, but 2 is less risky.

Description of changes: pin prometheus version to 2.55.1 instead of pulling latest.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

gmgtamz commented 1 week ago

tested the download of the specific version, LGTM