aws / aws-node-termination-handler

Gracefully handle EC2 instance shutdown within Kubernetes
https://aws.amazon.com/ec2
Apache License 2.0
1.63k stars 265 forks source link

SQS Monitor should handle all kinds of InvalidInstanceID errors #1034

Closed guessi closed 2 months ago

guessi commented 2 months ago

Describe the bug

If invalid event format in queue, NTH would failed.

Steps to reproduce

Produce invalid instance id to queue, and NTH would emit error as follow

{"level":"error","error":"InvalidInstanceID.Malformed: Invalid id: \"arn:aws:ec2:REGION:ACCOUNT_ID:instance/INSTANCE_ID\"\n\tstatus code: 400, request id: ...","time":"...","message":"ignoring interruption event due to error"}
...
{"level":"error","error":"some interruption events for message Id ... could not be processed","time":"...","message":"error processing interruption events"}
...
{"level":"warn","event_type":"SQS_MONITOR","error":"none of the waiting queue events could be processed","time":"...","message":"There was a problem monitoring for events"}
...

// ... few rounds later, NTH will stop working
{"level":"warn","time":"...","message":"Stopping NTH - Duplicate Error Threshold hit."}

Expected outcome

NTH should ignore the error and continue its work.

Application Logs

See above description

Environment

LikithaVemulapalli commented 2 months ago

Hello @guessi, thanks for the PR to skip in case of invalid InstanceID errors. NTH is used widely by many customers who have different setups and make changes as per their workloads. We follow a generic approach that can be reused and we made lot of configuration changes earlier and limited including additional configuration. We are okay with merging the open PR, but we prefer not to modify the InstanceID to pick ARN instead of ID as the field expects to get an ID. I hope this answers on why we cannot include allowing ARN along with InstanceID in future PR's. Could you please let us know why you cannot modify your workloads to accept ARN's in your fork. Appreciate the efforts and let us know if we can do anything from our end. Thanks :)

A1tairai commented 2 months ago
{
  "version": "0",
  "id": "xxxxx-xxxx-xxxx-xxx-xxxxxxxxxx",
  "detail-type": "AWS Health Event",
  "source": "aws.health",
  "account": "008971659696",
  "time": "2024-07-25T12:30:50Z",
  "region": "ap-southeast-1",
  "resources": ["i-xxxxx"],
  "detail": {
    "eventArn": "arn:aws:health:us-east-1::event/EC2/AWS_EC2_PLANNED_LIFECYCLE_EVENT/AWS_EC2_PLANNED_LIFECYCLE_EVENT_xxxxxxxx",
    "service": "EC2",
    "eventTypeCode": "AWS_EC2_PLANNED_LIFECYCLE_EVENT",
    "eventTypeCategory": "scheduledChange",
    "eventScopeCode": "ACCOUNT_SPECIFIC",
    "communicationId": "xxxxxxxxx-1",
    "startTime": "Thu, 25 Jul 2024 10:59:00 GMT",
    "endTime": "Thu, 25 Jul 2024 20:00:00 GMT",
    "lastUpdatedTime": "Thu, 25 Jul 2024 07:00:09 GMT",
    "statusCode": "upcoming",
    "eventRegion": "ap-southeast-1",
    "eventDescription": [
      {
        "language": "en_US",
        "latestDescription": "AWS_EC2_PLANNED_LIFECYCLE_EVENT"
      }
    ],
    "affectedEntities": [
      {
        "entityValue": "arn:aws:ec2:ap-southeast-1:xxxxx:instance/i-xxxxxxxxxxx",
        "status": "PENDING",
        "lastUpdatedTime": "Thu, 25 Jul 2024 07:00:09 GMT"
      }
    ],
    "affectedAccount": "xxxxxxxx",
    "page": "1",
    "totalPages": "1"
  }
}

For AWS_EC2_PLANNED_LIFECYCLE_EVENT the entity value is ARN from AWS health. Is it possible to have a logic to extract the instance id from ARN?

LikithaVemulapalli commented 2 months ago

Hello @A1tairai, thanks for reaching out, as already mentioned above we do not support passing ARN's for InstanceID. NTH already had a lot of configuration changes and that increased complexity of this project and team made a decision to not include additional configuration changes. We are not inclined towards supporting ARN's in NTH which might lead to other issues for existing customers. I hope this will give you a clarity for this use case. Thanks, please let us know if you need any help by creating another issue.