proposal: gcp: add a `gcp.<datastream>.flattened` field for each of the datastreams

We have had customer requests to retain additonal fields in the GCP integration so that fields that we are currently removing from documents are available for them to use in detection rules. In particular this is in the audit datastream, but could be relevant to others.

I propose that we rename the json temporary field to gcp.<datastream>.flattened (or similar) which would be mapped either as a type: flattened or index: false. We would also add a configuration option to the datastream UI that sets a default false flag, keep_json. In the ingest pipeline this flag would be used to conditionally remove the gcp.<datastream>.flattened field if not true.

For current users, this would have no impact as the field would by default no be in their ingested documents, but for users wishing to use fields that we have otherwise dropped, they can set the option to true and in their @custom pipeline (and associated mapping definition) they can extract and process the fields that they are interested in and then optionally remove the gcp.<datastream>.flattened field if they do not need it further.

Note that this functionality can be achieved currently with additional work by adding an @custom pipeline that would {"json":{"field":"event.original","target_field":"_tmp_json","if":"ctx.event?.original != null"}}, doing the additional processing and then deleting _tmp_json (or giving it a more durable name and adding it appropriately to the mappings).

Example change for the audit datastream:

diff --git a/packages/gcp/data_stream/audit/agent/stream/gcp-pubsub.yml.hbs b/packages/gcp/data_stream/audit/agent/stream/gcp-pubsub.yml.hbs
index 43af08afa..57a8784f9 100644
--- a/packages/gcp/data_stream/audit/agent/stream/gcp-pubsub.yml.hbs
+++ b/packages/gcp/data_stream/audit/agent/stream/gcp-pubsub.yml.hbs
@@ -27,7 +27,11 @@ tags:
 {{#contains "forwarded" tags}}
 publisher_pipeline.disable_host: true
 {{/contains}}
-{{#if processors}}
 processors:
+- add_fields:
+    target: '_conf'
+    fields:
+        keep_json: {{keep_json}}
+{{#if processors}}
 {{processors}}
 {{/if}}
diff --git a/packages/gcp/data_stream/audit/elasticsearch/ingest_pipeline/default.yml b/packages/gcp/data_stream/audit/elasticsearch/ingest_pipeline/default.yml
index 5a78745ec..fd7316588 100644
--- a/packages/gcp/data_stream/audit/elasticsearch/ingest_pipeline/default.yml
+++ b/packages/gcp/data_stream/audit/elasticsearch/ingest_pipeline/default.yml
@@ -363,8 +363,13 @@ processors:
 ##
 # clean-up
 ##
+  - rename:
+      field: json
+      target_field: gcp.audit.flattened
+      if: ctx.json != null && ctx._conf?.keep_json == true
   - remove:
       field:
+        - _conf
         - _temp
         - json
       ignore_missing: true
diff --git a/packages/gcp/data_stream/audit/fields/fields.yml b/packages/gcp/data_stream/audit/fields/fields.yml
index 027cc591b..d0e78e65d 100644
--- a/packages/gcp/data_stream/audit/fields/fields.yml
+++ b/packages/gcp/data_stream/audit/fields/fields.yml
@@ -113,3 +113,6 @@
         - name: message
           type: keyword
           description: "A developer-facing error message, which should be in English. Any user-facing  error message should be localized and sent in the google.rpc.Status.details  field, or localized by the client."
+    - name: flattened
+      type: flattened
+      description: Contains the full audit document as sent by GCP.
\ No newline at end of file
diff --git a/packages/gcp/data_stream/audit/manifest.yml b/packages/gcp/data_stream/audit/manifest.yml
index 130daabdc..7ec236667 100644
--- a/packages/gcp/data_stream/audit/manifest.yml
+++ b/packages/gcp/data_stream/audit/manifest.yml
@@ -65,6 +65,14 @@ streams:
         type: bool
         multi: false
         default: false
+      - name: keep_json
+        required: true
+        show_user: false
+        title: Keep the JSON document as `gcp.audit.flattened`
+        description: Keeps a copy of the original document as a JSON field for processing in `@custom` pipelines.
+        type: bool
+        multi: false
+        default: false
       - name: processors
         type: yaml
         title: Processors
diff --git a/packages/gcp/docs/README.md b/packages/gcp/docs/README.md
index 577421df0..9c1b16884 100644
--- a/packages/gcp/docs/README.md
+++ b/packages/gcp/docs/README.md
@@ -260,6 +260,7 @@ The `audit` dataset collects audit logs of administrative activities and accesse
 | gcp.audit.authorization_info.resource_attributes.name | The name of the resource. | keyword |
 | gcp.audit.authorization_info.resource_attributes.service | The name of the service. | keyword |
 | gcp.audit.authorization_info.resource_attributes.type | The type of the resource. | keyword |
+| gcp.audit.flattened | Contains the full audit document as sent by GCP. | flattened |
 | gcp.audit.labels | A map of key, value pairs that provides additional information about the log entry. The labels can be user-defined or system-defined. | flattened |
 | gcp.audit.logentry_operation.first | Optional. Set this to True if this is the first log entry in the operation. | boolean |
 | gcp.audit.logentry_operation.id | Optional. An arbitrary operation identifier. Log entries with the same identifier are assumed to be part of the same operation. | keyword |
diff --git a/packages/gcp/docs/audit.md b/packages/gcp/docs/audit.md
index 09038d517..d587ad23e 100644
--- a/packages/gcp/docs/audit.md
+++ b/packages/gcp/docs/audit.md
@@ -49,6 +49,7 @@ The `audit` dataset collects audit logs of administrative activities and accesse
 | gcp.audit.authorization_info.resource_attributes.name | The name of the resource. | keyword |
 | gcp.audit.authorization_info.resource_attributes.service | The name of the service. | keyword |
 | gcp.audit.authorization_info.resource_attributes.type | The type of the resource. | keyword |
+| gcp.audit.flattened | Contains the full audit document as sent by GCP. | flattened |
 | gcp.audit.labels | A map of key, value pairs that provides additional information about the log entry. The labels can be user-defined or system-defined. | flattened |
 | gcp.audit.logentry_operation.first | Optional. Set this to True if this is the first log entry in the operation. | boolean |
 | gcp.audit.logentry_operation.id | Optional. An arbitrary operation identifier. Log entries with the same identifier are assumed to be part of the same operation. | keyword |

(note to self — a local branch with this change exists)

elastic / integrations

proposal: gcp: add a `gcp.<datastream>.flattened` field for each of the datastreams #8184