elastic-package uses error.message as an indicator of pipeline failure

efd6 commented 1 year ago

The elastic-package test failure criteria include a check for a non-present error.message field to indicate that there has not been a pipeline failure during pipeline processing. This use is at odds with the documented semantics of the field.

For error in general

These fields can represent errors of any kind.

Use them for errors that happen while fetching events or in cases where the event itself contains an error.

and for the specific field

Error message.

Together there indicate that the field should be available for use by data sources to include source-specific error states for events.

This is currently not possible because a data stream that uses error.message to hold a data stream-specific error state would be identified as an ingest/processing failure by elastic-package.

The event.kind field does support marking pipeline failures with the "pipeline_error" value. So it is possible to distinguish pipeline failures from other uses of error.message, but this is currently brittle as the pipeline failure event kind is not automatic while setting error.message is.

In order for error.message to be fully open to use in its documented semantics, elastic-package would need to not rely on its state to determine whether pipeline failure has occurred, and if event.kind is used to signal pipeline failure, that pipelines all set that field correctly in the case of a pipeline failure.

jsoriano commented 1 year ago

In order for error.message to be fully open to use in its documented semantics, elastic-package would need to not rely on its state to determine whether pipeline failure has occurred

Yes, I think it would make sense to stop relying on this, and fail only if event.kind is pipeline_error.

if event.kind is used to signal pipeline failure, that pipelines all set that field correctly in the case of a pipeline failure.

Would it make sense to automatically add an on_failure handler that adds event.kind to all pipelines managed by fleet? This could be added in build time by elastic-package or on install time by Fleet. @juliaElastic @joshdover wdyt? Do you know if we have something similar now?

jsoriano commented 1 year ago

FTR, teams have been manually adding this on_failure handler to many packages https://github.com/elastic/integrations/issues/6582

joshdover commented 1 year ago

We have a fleet final pipeline that could easily be updated to do this for all Fleet-managed pipelines: https://github.com/elastic/kibana/blob/12a10d98559fb6fee98d81c7bbb22cf654187bde/x-pack/plugins/fleet/server/constants/fleet_es_assets.ts#L18

MakoWish commented 1 year ago

FTR, teams have been manually adding this on_failure handler to many packages elastic/integrations#6582

There are several PR's still pending approval/merge, but they should all be done once those are merged.

The only thing we may want to consider changing is that they were all initially done with a set processor, but I discussed with @efd6 about using an append processor instead. The reason for suggesting a change to append is to prevent the pipeline error from removing other values from event.kind that may affect the events' validity for Detection Rules, displaying on dashboards that are filtered by event.kind, and potentially other situations as well.

ishleenk17 commented 1 year ago

Would it make sense to automatically add an on_failure handler that adds event.kind to all pipelines managed by fleet? This could be added in build time by elastic-package or on install time by Fleet.

If this is being done at Fleet level, we will not need the change done in pipelines . Eg: here

ltflb-bgdi commented 4 months ago

Is there any timeline on this? I just ran into it (again) while working on a fix for the aws.cloudfront_log integration. If skipping errors based on event.kind: pipeline_error you may add a configuration option to silence specific errors on testing.

kpollich commented 3 months ago

We don't have a timeline on this, but it seems like a reasonable improvement to take on. I'm putting this in an upcoming sprint for the 8.17 cycle tentatively.

jsoriano commented 3 months ago

Maybe we can disable the check based on error.message in cases where we can check if there were failures based on the failure store (https://github.com/elastic/elastic-package/issues/1832).

ltflb-bgdi commented 3 months ago

As a workaround I use a specific tag in the test log config and only add the error message if this tag does not exist. This is far from ideal but does the job for the moment.

IMO the best solution would still be to ignore the presence of error.message in case of event.kind: pipeline_error, or even better only rely on event.kind: pipeline_error since this is the official way to indicate pipeline errors.

Side note: As of today, not all pipelines populate the event.kind field in case of errors (e.g. aws.cloudfront_logs). Thus, errors of these pipelines do not show up in the built-in ingest error search, since this search filters for both error.message and event.kind: pipeline_error. I strongly recommend to make sure all existing pipelines add pipeline_error to the error handlers where missing.

In case where adding this field is automatically added, please make sure that it is only added to the "global" pipeline on_failure object. I.e. on_failure handlers of processors itself must not lead to pipeline_errors. Use-case. Catching a processor error, do some stuff and proceed without error must still be supported.

elastic / elastic-package

elastic-package uses error.message as an indicator of pipeline failure #1392