DeterminateSystems / hydra-scale-equinix-metal

Scale Equinix Metal builders based on Hydra usage.
MIT License
2 stars 2 forks source link

Equinix Metal API errors aren't properly handled, causing panics #7

Open delroth opened 5 months ago

delroth commented 5 months ago
Feb 13 20:00:00 rhea scale[622512]: Work summary:
Feb 13 20:00:00 rhea scale[622512]: System("aarch64-linux") Small = 34642
Feb 13 20:00:00 rhea scale[622512]: System("aarch64-linux") BigParallel = 61
Feb 13 20:00:00 rhea scale[622512]: System("x86_64-linux") Small = 2934
Feb 13 20:00:00 rhea scale[622512]: System("x86_64-linux") BigParallel = 1
Feb 13 20:00:01 rhea scale[622512]: Creating: HardwarePlan {
Feb 13 20:00:01 rhea scale[622512]:     bid: 2.0,
Feb 13 20:00:01 rhea scale[622512]:     plan: "c3.large.arm64",
Feb 13 20:00:01 rhea scale[622512]:     netboot_url: "https://netboot.nixos.org/dispatch/hydra/hydra.nixos.org/equinix-metal-builders/main/c3-large-arm",
Feb 13 20:00:01 rhea scale[622512]: }
Feb 13 20:00:01 rhea scale[622512]: Creating: HardwarePlan {
Feb 13 20:00:01 rhea scale[622512]:     bid: 2.0,
Feb 13 20:00:01 rhea scale[622512]:     plan: "c3.large.arm64",
Feb 13 20:00:01 rhea scale[622512]:     netboot_url: "https://netboot.nixos.org/dispatch/hydra/hydra.nixos.org/equinix-metal-builders/main/c3-large-arm",
Feb 13 20:00:01 rhea scale[622512]: }
Feb 13 20:00:02 rhea scale[622512]: Creating: HardwarePlan {
Feb 13 20:00:02 rhea scale[622512]:     bid: 2.0,
Feb 13 20:00:02 rhea scale[622512]:     plan: "c3.large.arm64",
Feb 13 20:00:02 rhea scale[622512]:     netboot_url: "https://netboot.nixos.org/dispatch/hydra/hydra.nixos.org/equinix-metal-builders/main/c3-large-arm",
Feb 13 20:00:02 rhea scale[622512]: }
Feb 13 20:00:03 rhea scale[622512]: Error: failed to parse json, here's the raw content: Object {
Feb 13 20:00:03 rhea scale[622512]:     "errors": Array [
Feb 13 20:00:03 rhea scale[622512]:         String("The facility da11 has no provisionable c3.large.arm64 servers at the requested price"),
Feb 13 20:00:03 rhea scale[622512]:     ],
Feb 13 20:00:03 rhea scale[622512]: }
Feb 13 20:00:03 rhea scale[622512]: Caused by:
Feb 13 20:00:03 rhea scale[622512]:     missing field `hostname` at line 1 column 99
Feb 13 20:00:03 rhea scale[622512]: Location:
Feb 13 20:00:03 rhea scale[622512]:     src/device.rs:92:10

This is causing the service to fail with an exit code that can't be distinguished from a "real" error - when really this is more of an expected condition. It should be properly recognized and changed to either a success code or an exit code that we can filter in our monitoring for non-critical failures.

delroth commented 5 months ago

Another one:

Feb 13 19:40:00 rhea scale[615235]: Work summary:
Feb 13 19:40:00 rhea scale[615235]: System("aarch64-linux") BigParallel = 61
Feb 13 19:40:00 rhea scale[615235]: System("aarch64-linux") Small = 34677
Feb 13 19:40:00 rhea scale[615235]: System("x86_64-linux") BigParallel = 2
Feb 13 19:40:00 rhea scale[615235]: System("x86_64-linux") Small = 1846
Feb 13 19:40:01 rhea scale[615235]: Creating: HardwarePlan {
Feb 13 19:40:01 rhea scale[615235]:     bid: 2.0,
Feb 13 19:40:01 rhea scale[615235]:     plan: "c3.large.arm64",
Feb 13 19:40:01 rhea scale[615235]:     netboot_url: "https://netboot.nixos.org/dispatch/hydra/hydra.nixos.org/equinix-metal-builders/main/c3-large-arm--big-parallel",
Feb 13 19:40:01 rhea scale[615235]: }
Feb 13 19:40:01 rhea scale[615235]: Error: failed to parse json, here's the raw content: Object {
Feb 13 19:40:01 rhea scale[615235]:     "errors": Array [
Feb 13 19:40:01 rhea scale[615235]:         String("There aren't available servers at any facility"),
Feb 13 19:40:01 rhea scale[615235]:     ],
Feb 13 19:40:01 rhea scale[615235]: }
Feb 13 19:40:01 rhea scale[615235]: Caused by:
Feb 13 19:40:01 rhea scale[615235]:     missing field `hostname` at line 1 column 61
Feb 13 19:40:01 rhea scale[615235]: Location:
Feb 13 19:40:01 rhea scale[615235]:     src/device.rs:92:10
Feb 13 19:40:01 rhea systemd[1]: hydra-scale-equinix-metal.service: Main process exited, code=exited, status=1/FAILURE