elastic / elastic-serverless-forwarder

Elastic Serverless Forwarder
Other
35 stars 36 forks source link

Misconfigured ESF input results in a warning message, not events sent to replay queue #696

Closed asmith-elastic closed 4 months ago

asmith-elastic commented 5 months ago

There is a potential of data loss when configuring an incorrect event_idwithin an ESF configuration. If the ID in the config does not match the ID of the SQS queue that triggered the ESF lambda, then the lambda will not send the event to elastic, nor will it send it to the replay or continue queue. It follows this warning with an info message saying that it processed all the events.

As it currently stands, I can see that within the source code we have a very explicit check to see if the event id matches the input ID:

event_input = config.get_input_by_id(input_id)
if event_input is None:
    shared_logger.warning("no input defined", extra={"input_id": input_id})
    continue

Ref: https://github.com/elastic/elastic-serverless-forwarder/blob/main/handlers/aws/handler.py#L387

In this snippet, you can see that when the Lambda handler cannot find a matching input configuration for the given input_id, it logs a warning and then continues to the next record by executing the continue statement.

The continue statement immediately skips the remaining code for the current iteration of the loop and jumps to the next iteration. This means that all logic that could potentially place the message onto the replay queue is bypassed for this event. The code simply moves on to the next record without further handling the problem. This behavior implies that the Lambda function has been designed intentionally (or maybe unintentionally) not to replay events if it cannot find the input configuration. The rationale behind this could be that replaying an event for which there is no configuration would perpetually fail, as the situation won’t change with a retry.

constanca-m commented 5 months ago

Hey @asmith-elastic,

In issue https://github.com/elastic/enhancements/issues/21296 you linked a document, but I cannot do the upload of it. Could you upload it again here? I am curious to see what is in it, I think it could help clarify a few things. Mainly the message no input defined and see what is inside input_id.

Lastly, I am bit confused on this issue. When you mention a misconfigured id do you mean the input id of the config.yaml file? I don't think a non-existent id could trigger anything on ESF, hence the reason why there are no logs for it.

constanca-m commented 5 months ago

My confusion is mainly because I can reproduce this issue with a message coming from a correct SQS ID, but the contrary is not true. And from what I have seen, this does not cause any loss of data.

For example, I am sending Some message message through my SQS queue. I printed the records we are getting here: https://github.com/elastic/elastic-serverless-forwarder/blob/30c41f4547aaf09080bd7babacb20f34c6b2cc03/handlers/aws/handler.py#L374

For the first trigger, I get just one record:

  {
    "messageId":"e296ddcc-b41c-45fe-a478-84dd8d381f8a",
    "receiptHandle":"AQEBcJAmcT72XyE9NU8okjD/6Psk0X/CP8b2DQ8M3gUTPZ8Y4wPKsuSWEoPScAWPlQ4LaIMd/PBGEAJI/6t5h/qy/4gCB62JW9Hgx+jq/hOLzltFHvKyyo9BEqg7W+eQY9FM9qjdLzIajxs7cliG6Cfkp0HV18Kaeyc/bIi5sn4k5WtG7z/jKIQKM9HatUR9rk6mZuoIakJoCc/kz89ygREIM6MFu98iqlWzJnA5kBtFHLQPoDfJFbmVXF0hv0GkCc8FBypt2p2/QJhm8hsI46EFTiztT4tQdMqaXZoLu8cPaU8nnDPxCpmHYU9CVqROWzo9e7KWFhZRrrkS+qCOSoQKXy0E3bzkVA4/pOoJP6pgz1QH5Y2kBYzLnpj20s1+QmB7Ch39m1Z1rtsBpzyvYSSZ8g==",
    "body":"Some message",
    "attributes":{
      "ApproximateReceiveCount":"1",
      "SentTimestamp":"1714744983017",
      "SenderId":"AIDAZEDJODE3IITEGJ2XC",
      "ApproximateFirstReceiveTimestamp":"1714744983022"
    },
    "messageAttributes":{

    },
    "md5OfBody":"262b4ed09d824f4baffc7e60b319acac",
    "eventSource":"aws:sqs",
    "eventSourceARN":"arn:aws:sqs:eu-west-2:627286350134:constanca-esf-issue-696",
    "awsRegion":"eu-west-2"
  }

And the state is then uploaded to the continuing queue. We can see that by lambda is going to shutdown, continuing on dedicated sqs queue message in the logs. So we know we hit the timeout. I printed some lines, and the timeout being triggered is just before this line, and then we give indication to put data in the continuing queue: https://github.com/elastic/elastic-serverless-forwarder/blob/30c41f4547aaf09080bd7babacb20f34c6b2cc03/handlers/aws/handler.py#L451

I am still not sure if this is the right way to handle it though...

But moving on, the next SQS record is:

{
    "messageId":"3959ac97-3bed-4bef-89b9-0686b0b00d9d",
    "receiptHandle":"AQEBAitgjKCHzcXJ45KD5bRzo1iWVnROlJm7KW26VcHSAtgU+EfTFhxQ3t2gylZmHDxpRadWnZK+nABPd8aggFj68fJaSZXjrsO5C9qPaUqIA4ZVK2c0Qivcsf2I/T9nvEUuAZsy9B0Svek1aBYcZ+13qBiksXr+zC6fppwOqWq8cRpbCrpSWXHU6b01O6kOo3uf9q2Lr9/ImXzyYbj1pyESYQBrcBJ/xjLcWrFnjIdS2K/Q4UKUHPE8bT8jMa5OObERgQF/6F7MvOhOy9VwG/naZca9ZzFyWGoP7PmrAEfaLz9UU/EYAPYRSjpThpI7deIfWtWvD1qa1hkN/5wV8Ew3Y4adczsw2JXzRvxjrPF4bv1TxgOoikpD98RS21fc2oLvY6MYJSiQCHRpz2FhHvvYP2GrsHNu1ei+dqlWevK34aM=",
    "body":"Some message",
    "attributes":{
      "ApproximateReceiveCount":"1",
      "AWSTraceHeader":"Root=1-6634ee97-daff1c897690e29b2b10bf15;Parent=7458a2ee427e6b27;Sampled=0;Lineage=6337b594:0",
      "SentTimestamp":"1714744985023",
      "SenderId":"AROAZEDJODE3GJ6YJI46K:constanca-test-esf",
      "ApproximateFirstReceiveTimestamp":"1714744985028"
    },
    "messageAttributes":{
      "originalLastEndingOffset":{
        "stringValue":"12",
        "stringListValues":[

        ],
        "binaryListValues":[

        ],
        "dataType":"Number"
      }
    },
    "md5OfBody":"262b4ed09d824f4baffc7e60b319acac",
    "md5OfMessageAttributes":"1498dc83cd0d6d90f5542f2158be7425",
    "eventSource":"aws:sqs",
    "eventSourceARN":"arn:aws:sqs:eu-west-2:627286350134:constanca-test-esf-continuing-queue",
    "awsRegion":"eu-west-2"
  }

Which then fails. But there is no more data to be read from there, that is why I don't understand what data loss we are experiencing.

constanca-m commented 5 months ago

Steps to reproduce this issue:

  1. Write the wrong id on the config.yaml.
  2. Set the aws_lambda_event_source_mapping with the correct SQS id.

The inputs ids recognized by ESF are coming from the ones specified in the config.yaml. This happens in this line: https://github.com/elastic/elastic-serverless-forwarder/blob/30c41f4547aaf09080bd7babacb20f34c6b2cc03/handlers/aws/handler.py#L72

When parsing the config, we iterate over the entries and add the id to a list of inputs.

Then when ESF gets a trigger, in this case from SQS, we read the input id from the field EventSourceArn of the SQS message. Using this field, we compare it to the ids we got from the config.yaml: https://github.com/elastic/elastic-serverless-forwarder/blob/30c41f4547aaf09080bd7babacb20f34c6b2cc03/handlers/aws/handler.py#L380-L387

They don't match, so nothing else happens like mentioned in the description of this issue.

We are following the same logic for all triggers, so maybe they should all be changed as well.

I will try to refactor the code to do what was correctly suggested by @asmith-elastic

The messages SHOULD hit the replay queue. If messages hit the replay queue, they would eventually hit the DLQ and there would be no data loss.