buildkite / buildkite-agent-metrics

A command-line tool (and Lambda) for collecting Buildkite agent metrics
MIT License
67 stars 54 forks source link

stackdriver: Crash when transient error or rate limiting happens. #89

Open philwo opened 5 years ago

philwo commented 5 years ago

Spotted today in our logs:

Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Main process exited, code=exited, status=1/FAILURE
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Unit entered failed state.
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Failed with result 'exit-code'.

We can workaround this by restarting the daemon via systemd on failure. Note to others: It's important to set RestartSec=10 (or higher) in the systemd unit file, otherwise the daemon will immediately crash again, because of Stackdriver rate limiting the updates:

Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Main process exited, code=exited, status=1/FAILURE
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Unit entered failed state.
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Failed with result 'exit-code'.

(The daemon should probably also handle these rate limiting errors better.)

dmoxyeze commented 7 months ago

Spotted today in our logs:

Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Main process exited, code=exited, status=1/FAILURE
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Unit entered failed state.
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Failed with result 'exit-code'.

We can workaround this by restarting the daemon via systemd on failure. Note to others: It's important to set RestartSec=10 (or higher) in the systemd unit file, otherwise the daemon will immediately crash again, because of Stackdriver rate limiting the updates:

Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Main process exited, code=exited, status=1/FAILURE
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Unit entered failed state.
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: buildkite-agent-metrics@testing.service: Failed with result 'exit-code'.

(The daemon should probably also handle these rate limiting errors better.)

hi @philwo could you please share with me how you are able to get this work in GCP? I am also using GCP but somehow, I can't get my Horizontal Pod Autoscaler to work using the metrics from buildkite-agent metrics.

philwo commented 7 months ago

Hi @dmoxyeze,

I'm sorry, I don't remember if I ever got this to work reliably. Whatever I tried at the time definitely didn't work well for auto-scaling. I think that was also because auto-scaling on GCP then didn't support a good way to signal which machines are "safe to shutdown", so it often picked the ones that were still running jobs.

Sorry that I can't be of more help here, hope you can figure it out!

Philipp

dmoxyeze commented 7 months ago

Hi Philip,

Thanks for the help all the same.

Best regards, Success.

On Wed, Apr 17, 2024 at 9:01 AM Philipp Wollermann @.***> wrote:

Hi @dmoxyeze https://github.com/dmoxyeze,

I'm sorry, I don't remember if I ever got this to work reliably. Whatever I tried at the time definitely didn't work well for auto-scaling. I think that was also because auto-scaling on GCP then didn't support a good way to signal which machines are "safe to shutdown", so it often picked the ones that were still running jobs.

Sorry that I can't be of more help here, hope you can figure it out!

Philipp

— Reply to this email directly, view it on GitHub https://github.com/buildkite/buildkite-agent-metrics/issues/89#issuecomment-2060153495, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG5AHW3QHZEKU5EGZDZZAA3Y5XCXXAVCNFSM4HXXNADKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBWGAYTKMZUHE2Q . You are receiving this because you were mentioned.Message ID: @.***>