gardener / gardener-extension-provider-gcp

Gardener extension controller for the GCP cloud provider (https://cloud.google.com).
https://gardener.cloud
Apache License 2.0
11 stars 77 forks source link

Machine with error`ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS` is not flagged as user error #766

Open AleksandarSavchev opened 1 month ago

AleksandarSavchev commented 1 month ago

How to categorize this issue?

/area ops-productivity /kind bug /platform gcp

What happened: With @ialidzhikov we found that an machine error such as

status:
  currentStatus:
    lastUpdateTime: "2024-05-21T12:57:28Z"
    phase: CrashLoopBackOff
  lastOperation:
    description: 'Cloud provider message - machine codes error: code = [ResourceExhausted]
      message = [Create machine "machine"
      failed: The zone ''zone'' does not have
      enough resources available to fulfill the request.  ''(resource type:compute)''.]'
    errorCode: ResourceExhausted
    lastUpdateTime: "2024-05-21T12:57:28Z"
    state: Failed
    type: Create

is not properly categorised as user error since it should be matched by https://github.com/gardener/gardener-extension-provider-gcp/blob/c32ba9eec28e676d15de9c55b08b78ff0235c0b5/pkg/apis/gcp/helper/error_codes.go#L16

however ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS is replaced by ResourceExhausted here: https://github.com/gardener/machine-controller-manager-provider-gcp/blob/295ac09467c51746f87762130a25934be202df68/pkg/gcp/machine_controller_util.go#L440-L446

What you expected to happen: Error to be properly flagged as user error.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

kon-angelo commented 1 month ago

Do you think it makes sense if instead of fixing this occurrence, we rather make the extension worker library aware of MCM error codes and map them to gardener error codes ? WDYT @ialidzhikov ?

ialidzhikov commented 1 month ago

Sounds reasonable, at least for the machine-controller-manager error codes that we can map unambiguously - Unauthenticated, PermissionDenied, ResourceExhausted. We can even do it without regex, as the corresponding error code is present in the Machine status (.status.errorCode field). I didn't check whether it is propagated to the MachineDeployment status.