Open philwo opened 5 years ago
Spotted today in our logs:
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0] Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0] Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Main process exited, code=exited, status=1/FAILURE Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Unit entered failed state. Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Failed with result 'exit-code'.
We can workaround this by restarting the daemon via systemd on failure. Note to others: It's important to set
RestartSec=10
(or higher) in the systemd unit file, otherwise the daemon will immediately crash again, because of Stackdriver rate limiting the updates:Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0] Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0] Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Main process exited, code=exited, status=1/FAILURE Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Unit entered failed state. Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Failed with result 'exit-code'.
(The daemon should probably also handle these rate limiting errors better.)
hi @philwo could you please share with me how you are able to get this work in GCP? I am also using GCP but somehow, I can't get my Horizontal Pod Autoscaler to work using the metrics from buildkite-agent metrics.
Hi @dmoxyeze,
I'm sorry, I don't remember if I ever got this to work reliably. Whatever I tried at the time definitely didn't work well for auto-scaling. I think that was also because auto-scaling on GCP then didn't support a good way to signal which machines are "safe to shutdown", so it often picked the ones that were still running jobs.
Sorry that I can't be of more help here, hope you can figure it out!
Philipp
Hi Philip,
Thanks for the help all the same.
Best regards, Success.
On Wed, Apr 17, 2024 at 9:01 AM Philipp Wollermann @.***> wrote:
Hi @dmoxyeze https://github.com/dmoxyeze,
I'm sorry, I don't remember if I ever got this to work reliably. Whatever I tried at the time definitely didn't work well for auto-scaling. I think that was also because auto-scaling on GCP then didn't support a good way to signal which machines are "safe to shutdown", so it often picked the ones that were still running jobs.
Sorry that I can't be of more help here, hope you can figure it out!
Philipp
— Reply to this email directly, view it on GitHub https://github.com/buildkite/buildkite-agent-metrics/issues/89#issuecomment-2060153495, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG5AHW3QHZEKU5EGZDZZAA3Y5XCXXAVCNFSM4HXXNADKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBWGAYTKMZUHE2Q . You are receiving this because you were mentioned.Message ID: @.***>
Spotted today in our logs:
We can workaround this by restarting the daemon via systemd on failure. Note to others: It's important to set
RestartSec=10
(or higher) in the systemd unit file, otherwise the daemon will immediately crash again, because of Stackdriver rate limiting the updates:(The daemon should probably also handle these rate limiting errors better.)