aws-controllers-k8s / community

AWS Controllers for Kubernetes (ACK) is a project enabling you to manage AWS services from Kubernetes
https://aws-controllers-k8s.github.io/community/
Apache License 2.0
2.39k stars 253 forks source link

Leftover AWS resources not being deleted #1707

Open brandonphan opened 1 year ago

brandonphan commented 1 year ago

Describe the bug Investigating an issue on our deployment system where some AWS resources are left over despite the k8s resources being deleted. Wondering if there could be any scenario where this could happen our deletion process is simply kubectl delete ns "$NAMESPACE" --wait=false or helm uninstall $CHART_NAME --namespace $NAMESPACE through a deployment actions on Github.

Steps to reproduce I've managed to catch it one time and the SQS controller logs did not log the deleted resource even though the k8s resources were deleted suggesting the finalizers were removed somehow. Other than that I haven't been able to reproduce it, just observed so far.

Environment

Any insight helps, thanks!

a-hilaly commented 1 year ago

Hi @brandonphan , this might be related to https://github.com/aws-controllers-k8s/community/issues/1696. Do you leave some time for the controller the handle the delete event?

brandonphan commented 1 year ago

@A-Hilaly Excuse my ignorance, I was under the impression since the finalizer exists on the k8s.aws resources the controller would have to handle the delete event, if this isn't the case, what is the suggested method to ensure the controller handles the event? Our controllers are dealing with fairly high volumes of creation/deletion so this would explain why some resources are handled and others aren't.

a-hilaly commented 1 year ago

Hi @brandonphan ! As long as the controller is running and consuming events it will handle them :). Once a resource is scheduled for deletion (metadata.DeletionTimestamp is set) the controller will eventually receive and handle a deletion event. If the resource is deleted successfully the controller will remove the finalizer (the finalizer is set on the resource once it's created successfully) and the resource will eventually be removed from the api-server records. If I understand correctly, the controller is removing the finalizer from a resource (causing its deletion from the api-server) without deleting it in AWS? Could you, please, provide the controller version and maybe a few logs showing off the issue? I'd like to help you investigate this.

jessebye commented 1 year ago

Hey @A-Hilaly ,

I can confirm what @brandonphan is saying. We are currently seeing this with sqs-controller v0.0.3.

Logs are below. You can see at first it deletes and finalizes many of the resources successfully, but then a queue extra-vicuna-api-send-security-price-update-to-domo-dlq is deleted from AWS, but the finalizers are left by sqs-controller and it starts to throw an error.

2023-03-15T21:50:24.075Z    INFO    ackrt   deleted resource    {"account": "REDACTED", "role": "", "region": "us-east-2", "kind": "Queue", "namespace": "extra-vicuna", "name": "extra-vicuna-api-update-account-discretionary-type-command-dlq", "generation": 2}
2023-03-15T21:50:24.179Z    INFO    ackrt   deleted resource    {"account": "REDACTED", "role": "", "region": "us-east-2", "kind": "Queue", "namespace": "extra-vicuna", "name": "extra-vicuna-api-sync-account-command-dlq.fifo", "generation": 2}
2023-03-15T21:50:24.271Z    INFO    ackrt   deleted resource    {"account": "REDACTED", "role": "", "region": "us-east-2", "kind": "Queue", "namespace": "extra-vicuna", "name": "extra-vicuna-api-security-monthly-returns-updated-event-dlq", "generation": 2}
2023-03-15T21:50:24.349Z    INFO    ackrt   deleted resource    {"account": "REDACTED", "role": "", "region": "us-east-2", "kind": "Queue", "namespace": "extra-vicuna", "name": "extra-vicuna-api-risk-number-details-report-dlq", "generation": 2}
2023-03-15T21:50:24.467Z    INFO    ackrt   deleted resource    {"account": "REDACTED", "role": "", "region": "us-east-2", "kind": "Queue", "namespace": "extra-vicuna", "name": "extra-vicuna-api-provision-enterprise-user-command-dlq", "generation": 2}
2023-03-15T21:50:24.555Z    INFO    ackrt   deleted resource    {"account": "REDACTED", "role": "", "region": "us-east-2", "kind": "Queue", "namespace": "extra-vicuna", "name": "extra-vicuna-api-update-account-discretionary-type-command", "generation": 2}
2023-03-15T21:50:24.754Z    INFO    ackrt   deleted resource    {"account": "REDACTED", "role": "", "region": "us-east-2", "kind": "Queue", "namespace": "extra-vicuna", "name": "extra-vicuna-api-trading-activation-dlq", "generation": 2}
2023-03-15T21:50:24.852Z    INFO    ackrt   deleted resource    {"account": "REDACTED", "role": "", "region": "us-east-2", "kind": "Queue", "namespace": "extra-vicuna", "name": "extra-vicuna-api-sso-bulk-add-user-command", "generation": 2}
2023-03-15T21:50:25.066Z    INFO    ackrt   deleted resource    {"account": "REDACTED", "role": "", "region": "us-east-2", "kind": "Queue", "namespace": "extra-vicuna", "name": "extra-vicuna-api-ra-compare-command-dlq", "generation": 2}
2023-03-15T21:50:25.204Z    INFO    ackrt   deleted resource    {"account": "REDACTED", "role": "", "region": "us-east-2", "kind": "Queue", "namespace": "extra-vicuna", "name": "extra-vicuna-api-finish-account-creation-command-dlq", "generation": 2}
2023-03-15T21:50:25.398Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "extra-vicuna-api-send-security-price-update-to-domo-dlq", "namespace": "extra-vicuna", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: 88b3f7c3-82cd-55fe-b003-ab4fbd2caa07"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T21:50:27.000Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "extra-vicuna-api-send-security-price-update-to-domo-dlq", "namespace": "extra-vicuna", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: 8dbe045c-9179-58a1-95e2-58d0ae9bf758"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T21:50:28.245Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "extra-vicuna-api-send-security-price-update-to-domo-dlq", "namespace": "extra-vicuna", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: 857f827f-d9e5-52ea-9d05-def3e601dd13"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T21:50:28.639Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "extra-vicuna-api-send-security-price-update-to-domo-dlq", "namespace": "extra-vicuna", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: befa0fdd-cccf-5fb4-8d73-5351f32b504d"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T21:50:28.817Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "extra-vicuna-api-send-security-price-update-to-domo-dlq", "namespace": "extra-vicuna", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: 8ee9c651-214d-5f38-95ca-9add1ea970c1"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T21:50:29.159Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "extra-vicuna-api-send-security-price-update-to-domo-dlq", "namespace": "extra-vicuna", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: 0fb7d15c-f26b-5efc-a263-8dd0cd8ea105"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T21:50:29.820Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "extra-vicuna-api-send-security-price-update-to-domo-dlq", "namespace": "extra-vicuna", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: 96660f8d-d296-513b-b747-5301ed737ec2"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T21:50:31.125Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "extra-vicuna-api-send-security-price-update-to-domo-dlq", "namespace": "extra-vicuna", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: 8054e576-f873-55d5-b66a-b9fac7946a2c"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T21:50:33.706Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "extra-vicuna-api-send-security-price-update-to-domo-dlq", "namespace": "extra-vicuna", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: 6f6e1139-85df-52b1-82df-4db5ba36cd48"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T21:50:38.849Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "extra-vicuna-api-send-security-price-update-to-domo-dlq", "namespace": "extra-vicuna", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: 2d7a6c86-c475-57ee-80b3-300afb9b6817"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T21:50:49.117Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "extra-vicuna-api-send-security-price-update-to-domo-dlq", "namespace": "extra-vicuna", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: d59f5771-d90c-5d45-8836-0bfde6f512bc"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T21:51:09.628Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "extra-vicuna-api-send-security-price-update-to-domo-dlq", "namespace": "extra-vicuna", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: 251cef6e-e5c5-5ab0-a56b-b3e2a5c83549"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T21:51:50.616Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "extra-vicuna-api-send-security-price-update-to-domo-dlq", "namespace": "extra-vicuna", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: 8188e51b-ed42-52ea-9564-b997ad748d13"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T21:53:12.584Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "extra-vicuna-api-send-security-price-update-to-domo-dlq", "namespace": "extra-vicuna", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: 1f571d8a-7537-505d-801e-410635bff4b2"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T21:54:48.887Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "senior-boa-api-refresh-halo-cache-result-command-dlq", "namespace": "senior-boa", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: 155cc636-e5e6-5755-b2a9-9abaa50e3865"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T21:55:56.469Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "extra-vicuna-api-send-security-price-update-to-domo-dlq", "namespace": "extra-vicuna", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: 12d34d8f-9172-5fa6-b127-1103d05e5e10"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T22:01:24.219Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "extra-vicuna-api-send-security-price-update-to-domo-dlq", "namespace": "extra-vicuna", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: 3bd46d5a-73ec-568f-b9f9-31d63198d183"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
2023-03-15T22:05:44.290Z    ERROR   controller.queue    Reconciler error    {"reconciler group": "sqs.services.k8s.aws", "reconciler kind": "Queue", "name": "senior-boa-api-refresh-halo-cache-result-command-dlq", "namespace": "senior-boa", "error": "AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist for this wsdl version.\n\tstatus code: 400, request id: 661a9ae2-c926-5211-b4bb-c66e8a5e217d"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227
jaypipes commented 1 year ago

@brandonphan @jessebye are you seeing this only with the SQS controller? or have you seen this with other controllers?

jessebye commented 1 year ago

I've consistently observed this with SQS controller. I have a suspicion it's happened with other controllers too, but don't have enough data to confirm that. @brandonphan probably has a better idea of that.

jessebye commented 1 year ago

@jaypipes one additional thing to note, we often tear down dozens of queues all at once. That's typically when I observe this problem - so maybe some kind of race condition?

jaypipes commented 1 year ago

@jessebye ok, thanks for the info. We'll try to set up a test that tears down many resources at once and try to reproduce the issue!

a-hilaly commented 1 year ago

@jessebye i just tried to create and delete 100 queues using one command line and i didn't see any similar issues. Also all the queues are successfully deleted from the aws and the API server. Can you provide us a manifest of the queues you're creating? maybe i'm missing something here.

Details of the experiment: generate 100 queues, using go run main.go > queues.yaml (code below)

package main

import (
    "fmt"
    "strconv"
    "strings"
)

const queueTemplate = `apiVersion: sqs.services.k8s.aws/v1alpha1
kind: Queue
metadata:
  name: $NAME
spec:
  queueName: $NAME
  delaySeconds: "0"
`

const delimiter = "---\n"

func main() {
    s := ""
    for i := 0; i < 100; i++ {
        s += strings.Replace(queueTemplate, "$NAME", "ackqueue-"+strconv.Itoa(i), -1)
        s += delimiter
    }
    fmt.Println(s)
}

APply and delete everything from a K8s cluster using:

k apply -f queues.yaml
# wait for the creation of all the queues
k delete queues --all
brandonphan commented 1 year ago

@A-Hilaly these are our typical queue configs, ran your script and worked perfectly so maybe something to do with our configuration

apiVersion: sqs.services.k8s.aws/v1alpha1
kind: Queue
metadata:
  name: test-api-provision-enterprise-user-command
  namespace: test
  labels:
    app: api
    owner: shared
    version: "0.0.0"
    helm.sh/chart: api-0.0.0
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/version: "0.0.0"
    app.kubernetes.io/part-of: api
    app.kubernetes.io/name: api
    app.kubernetes.io/component: service
spec:
  queueName: test-api-provision-enterprise-user-command
  delaySeconds: "0"
  tags:
    Cluster: test-dev
    Environment: dev
    Managed By: sqs-controller
    Managed By Repo: api
    Namespace: test
    Owner: shared
  visibilityTimeout: "30"
  redrivePolicy: |
    {
      "deadLetterTargetArn": "arn:aws:sqs:us-east-2:REDACTED:test-api-provision-enterprise-user-command-dlq",
      "maxReceiveCount": "3"
    }
---
apiVersion: sqs.services.k8s.aws/v1alpha1
kind: Queue
metadata:
  name: test-api-provision-enterprise-user-command-dlq
  namespace: test
  labels:
    app: api
    owner: shared
    version: "0.0.0"
    helm.sh/chart: api-0.0.0
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/version: "0.0.0"
    app.kubernetes.io/part-of: api
    app.kubernetes.io/name: api
    app.kubernetes.io/component: service
spec:
  queueName: test-api-provision-enterprise-user-command-dlq
  delaySeconds: "0"
  tags:
    Cluster: test-dev
    Environment: dev
    Managed By: sqs-controller
    Managed By Repo: api
    Namespace: test
    Owner: shared

@jaypipes I've definitely noticed other leftover resources (roles, policies, etc.) but I've only seen abnormal logging with queues since our workloads deal with so many of them I assume.

a-hilaly commented 1 year ago

@jessebye @jaypipes I just tried again with complex 200 queues with a lot of attributes set.. couldn't reproduce. However, I tried to run a simple aws sqs delete twice on the same resource and i got a similar error: when calling the DeleteQueue operation: The specified queue does not exist for this wsdl version.

I also tried to run an aws sqs get-attributes on a non-existant queue and I got the same exact error you have in your logs: (AWS.SimpleQueueService.NonExistentQueue) when calling the GetQueueAttributes operation: The specified queue does not exist for this wsdl version.

Anyways the above gives me the impression that something else (probably another controller or system is making the deletion) - can you please double check (and if possible share with us) the helm charts/controller deployement.spec.replicas or any details about another system that might be messing with sqs queues (CFN, terraform etc...)

jessebye commented 1 year ago

@A-Hilaly I checked but we only have one replica for each AWS controller we are using. We are using the Helm charts provided in the repos with no overrides.

a-hilaly commented 1 year ago

Hi @jessebye , in this case I think the only way to know what's really happening is check CloudTrail logs and verify who's making the delete calls at the same time with the ACK controller. Normally ACK controllers have a https://github.com/aws-controllers-k8s/runtime/blob/main/pkg/runtime/session.go#L85-L94 user agent here.

jessebye commented 1 year ago

@A-Hilaly I was wondering if there might be a simple solution to this. In the resource manager Delete function, could it check the error and only if the error is NonExistantQueue it could ignore the error and continue finalizing the resource?

I can't think of any good reason why the K8S resource should stick around in an errored terminating state if the AWS resource is gone?

a-hilaly commented 1 year ago

could it check the error and only if the error is NonExistantQueue it could ignore the error and continue finalizing the resource?

Technically that is already handled in the behind the scenes of delete code path: https://github.com/aws-controllers-k8s/runtime/blob/main/pkg/runtime/reconciler.go#L738-L746

I can't think of any good reason why the K8S resource should stick around in an errored terminating state if the AWS resource is gone?

I agree with you... happy to go modify the internals of runtime or code-generator if needed. I just wanna make sure that there are any races causing this issue since i'm not able to reproduce the same issue

a-hilaly commented 1 year ago

Now looking at the sqs code i'm not sure how correct is this block ... https://github.com/aws-controllers-k8s/sqs-controller/blob/main/pkg/resource/queue/sdk.go#L79-L81 ... Also i'm not able to find NonExistantQueue in sqs erors documentation

/cc @jaypipes

jessebye commented 1 year ago

Yeah, it's being very difficult to reproduce. If I create/delete ~100 queues in sequence, it won't happen. However, if I create the queues from a Helm chart, then uninstall the chart, the problem happens probably 10% of the time. :thinking:

jessebye commented 1 year ago

@A-Hilaly looks like the correct code is sqs.ErrCodeQueueDoesNotExist

a-hilaly commented 1 year ago

And that is not even mentioned in the SQS docs.. SMH.. Alright i'll do this, i will create a helm chart with 200 queues deploy and instantly uninstall and see what will happen * 20 times - i really wanna reproduce this lol. Worst case i'll try with some multi threaded Go program maybe that will trigger something

jessebye commented 1 year ago

I'll give it my best shot too and will let you know if I find a way to reproduce it reliably.

jessebye commented 1 year ago

Something just occurred to me. We use DLQs with many of our queues. When we create queues, we create them all simultaneously; queue and associated DLQ. The queue fails to apply because the DLQ doesn't exist yet, but then the sqs-controller re-attempts a reconcile and the queue gets applied correctly.

I have no idea if that might affect teardown, but it's something to consider.

jessebye commented 1 year ago

Ok, I've been creating sets of 1,500 queues and tearing them down, all in parallel, using Helm... still no errors. THEN I created 15 ArgoCD applications with the same Helm chart, and the first one to tear down had the error. Really baffled how ArgoCD uninstalling the Helm chart would make any difference :confused:

jessebye commented 1 year ago

Bingo! I think I've found a way to reproduce it:

  1. Create a bunch of queues.
  2. Use kubectl delete queue --cascade='foreground' to delete the queues.
  3. Observe some queues don't get deleted, and sqs-controller starts logging the NonExistantQueue error.

As it turns out, Argo defaults to foreground cascade. That's why the problem happens there, but not when we were doing other operations with kubectl and helm.

a-hilaly commented 1 year ago

That was the trick we needed! thank you @jessebye ! Here are some extra logs I could catch during the deletion process:

2023-04-03T16:05:19.985+0200    DEBUG   ackrt   <<<< kc.Patch (metadata + spec) {"account": "771174509839", "role": "", "region": "us-west-2", "kind": "Queue", "namespace": "default", "name": "ackqueue-0", "generation": 2, "error": "Queue.sqs.services.k8s.aws \"ackqueue-0\" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{\"foregroundDeletion\"}"}
2023-04-03T16:05:19.985+0200    DEBUG   ackrt   <<< r.patchResourceMetadataAndSpec  {"account": "771174509839", "role": "", "region": "us-west-2", "kind": "Queue", "namespace": "default", "name": "ackqueue-0", "generation": 2, "error": "Queue.sqs.services.k8s.aws \"ackqueue-0\" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{\"foregroundDeletion\"}"}
2023-04-03T16:05:19.985+0200    DEBUG   ackrt   << r.setResourceUnmanaged   {"account": "771174509839", "role": "", "region": "us-west-2", "kind": "Queue", "namespace": "default", "name": "ackqueue-0", "generation": 2, "error": "Queue.sqs.services.k8s.aws \"ackqueue-0\" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{\"foregroundDeletion\"}"}
2023-04-03T16:05:19.985+0200    DEBUG   ackrt   < r.deleteResource  {"account": "771174509839", "role": "", "region": "us-west-2", "kind": "Queue", "namespace": "default", "name": "ackqueue-0", "generation": 2, "error": "Queue.sqs.services.k8s.aws \"ackqueue-0\" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{\"foregroundDeletion\"}"}
2023-04-03T16:05:19.985+0200    ERROR   Reconciler error    {"controller": "queue", "controllerGroup": "sqs.services.k8s.aws", "controllerKind": "Queue", "Queue": {"name":"ackqueue-0","namespace":"default"}, "namespace": "default", "name": "ackqueue-0", "reconcileID": "ed30c91e-4bd2-456d-84ab-f0843b31902f", "error": "Queue.sqs.services.k8s.aws \"ackqueue-0\" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{\"foregroundDeletion\"}"}

I'm not sure what's exactly happening but my initial read looks the controller is trying to add foregroundDeletion when it shouldn't. THis is very likely a runtime issue and it's impacting all the controllers as well. I will cut a special ticket for it :) (and it should be a high priority one) Thank you folks for reporting and helping in reproducing this issue!

jessebye commented 1 year ago

@A-Hilaly Amazing, glad that we could help track it down!

ack-bot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Provide feedback via https://github.com/aws-controllers-k8s/community. /lifecycle stale

jessebye commented 1 year ago

/remove-lifecycle stale

ack-bot commented 8 months ago

Issues go stale after 180d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 60d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Provide feedback via https://github.com/aws-controllers-k8s/community. /lifecycle stale

ack-bot commented 2 months ago

Issues go stale after 180d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 60d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Provide feedback via https://github.com/aws-controllers-k8s/community. /lifecycle stale