Better ingestion error handling

Describe the enhancement:

Currently, when a non-retryable error occurs while trying to push data into an index, the only way to know something failed is to access the CloudWatch logs and to hope we will be able to filter out the ESF logs to find the relevant error.

An enhancement can be to catch the error when it happens and to forward it to an Elasticsearch data stream dedicated to store ingestion errors.

Describe a specific use case for the enhancement or feature:

Let's say some malformed data is coming or a specific case that is not handled properly by the ingest pipeline processing the data. This ends up with fields in the to-be-ingested document not properly processed causing some mapping violation. The Elasticsearch engine will then reject the document with the following error sent back to the ESF client:

{
  "@timestamp": "2023-01-01T00:00:00.000Z",
  "log.level": "warning",
  "message": "elasticsearch shipper",
  "_id": "doc_id",
  "ecs": {
    "version": "1.6.0"
  },
  "error": {
    "caused_by": {
      "reason": "For input string: \"-\"",
      "type": "illegal_argument_exception"
    },
    "reason": "failed to parse field [client.ip] of type [ip] in document with id 'doc_id'. Preview of field's value: '-'",
    "type": "mapper_parsing_exception"
  }
}

This error will only be seen in the CloudWatch console and won't be easily searchable.

However, let's say that error is catched and ingested into an Elasticsearch data stream, say logs-elastic_esf.error-default, formatted as the following:

{
  "@timestamp": "2023-01-01T00:00:00.000Z",
  "log.level": "warning",
  "error": {
    "message": "failed to parse field [client.ip] of type [ip] in document with id 'doc_id'. Preview of field's value: '-'",
    "type": "mapper_parsing_exception",
    "id": "doc_id",
    "stacktrace": "{\"reason\": \"For input string: \\\"-\\\"\",\n\"type\":\"illegal_argument_exception\"}"
  }
}

This will allow the Elastic stack administrators to search (or, even better, be alerted) for ingestion errors, investigate on the causes (mapping violation, faulty pipelines, ...) and fix the issue without having to rely on the platform administrators to find out the error.

An other possible way would be also to use APM to forward and store those issue, as some errors seem to be present but without the necessary details to investigate

@srilumpa , the original event that failed to be ingested will be sent to the replay queue as well (you can check its url looking at the environment variable SQS_REPLAY_URL)

you can follow the documentation at https://www.elastic.co/guide/en/esf/master/aws-serverless-troubleshooting.html#_errors_during_ingestion about it

basically it is up to you if consuming the replay queue with your own method, looking at the messages that failed to be ingested, or add the replay queue itself as trigger of the forwarder lambda (no need to add it in the config file of the forwarder)

you can even have both: since failed ingestion could not happen only for non-retryable error (imagine your cluster suffers downtime), you could add the replay queue as trigger. if either the error is not transient, or it was not resolved after 3 retries to replay the message ingestion, the message subject of the error will end up in the replay queue DLQ

elastic / elastic-serverless-forwarder

Better ingestion error handling #336