[Discuss] Store a encoded copy of the original document for fields validation

Summary

Store a encoded copy of the original document with a processor in the final pipeline, before ingestion, and use this copy to validate fields and generate sample documents instead of rebuilding the document from the ingested data.

Split fields validations in two sets, one that uses this encoded copy, and another one for the indexed data.

Background

When validating fields we use the documents as they are stored in Elasticsearch. With the adoption of features like constant_keyword, runtime fields, synthetic index mode, or index: false in packages it can be difficult to rebuild the original document. Some mappings could also introduce additional multifields, that in some cases we are ignoring, or have to ignore.

We have now quite some code attempting to handle all these cases, and corner cases in combinations between them. Every time a new feature of this kind is added new corner cases appear.

Going back to the original objectives of these tests, we want to validate these two things:

That the package is generating the expected data.
That all the fields available to users are documented.

With the current approach of checking the documents ingested as returned by the search API, we are missing the first point, as in many cases we don't have the data the package is generating, and we attempt to rebuild the documents from the indexed data.

So the proposal would be to explicitly split validations in two:

Validations on the data generated by the package. They should be based on the resulting data after applying all the pipelines, but before ingesting. For that we need some way to store this data, this could be done by storing a encoded copy of the document in the document itself with a processor in the final_pipeline.
Validations on the ingested data, similar to the ones we have now, but they could be relaxed to validate only that the fields are documented, and ignore their values.

Some tests will do only one set of validations or both. The encoded copy could be additionally used for the generation of sample documents.

Done some tests manually in a Elastic stack locally, updating the final_pipeline managed by Fleet. Specifically, the test consisted on adding a new Script processor at the end of the final pipeline as:

def newField = []
for (entry in ctx.entrySet()) {
  // Just tested locally with a test package, and these fields were present
  // It should be tested with other packages to check if they can be removed
  // safely from the final field or not
  if (entry.getKey() == "_version") {
    continue
  }
  if (entry.getKey() == "_index") {
    continue
  }
  if (entry.getKey() == "_version_type") {
    continue
  }
  if (entry.getKey() == "_id") {
    continue
  }
  newField.add(entry.getKey()) = entry.getValue();
}
ctx['doc.before_ingested'] = newField;

Some fields are filtered in that code, since it looks like they are added or taken from the doc itself.

Adding that script processor, result in a new field like this:

  {
    "doc.before_ingested": [
      "agent: {name=elastic-agent-83113, id=4711410d-f9bf-416b-8b5e-eb829b9866c1, type=metricbeat, ephemeral_id=6a54ede2-9e62-4b0b-ad0e-2614c02e489b, version=8.15.2}",
      "@timestamp: 2024-10-03T14:27:14.869Z",
      "nginx: {stubstatus={hostname=svc-nginx:80, current=10, waiting=0, accepts=343, handled=343, writing=1, dropped=0, reading=0, active=1, requests=378}}",
      "ecs: {version=8.0.0}",
      "service: {address=http://svc-nginx:80/server-status, type=nginx}",
      "data_stream: {namespace=81181, type=metrics, dataset=nginx.stubstatus}",
      "elastic_agent: {id=4711410d-f9bf-416b-8b5e-eb829b9866c1, version=8.15.2, snapshot=false}",
      "host: {hostname=elastic-agent-83113, os={kernel=6.8.0-45-generic, codename=focal, name=Ubuntu, type=linux, family=debian, version=20.04.6 LTS (Focal Fossa), platform=ubuntu}, containerized=false, ip=[172.19.0.2, 172.18.0.7], name=elastic-agent-83113, id=93db770e92a444c98362aee1860ae326, mac=[02-42-AC-12-00-07, 02-42-AC-13-00-02], architecture=x86_64}",
      "metricset: {period=10000, name=stubstatus}",
      "event: {duration=282897, agent_id_status=verified, ingested=2024-10-03T14:27:15Z, module=nginx, dataset=nginx.stubstatus}",
      "_version_type: internal", <-- filtered
      "_index: metrics-nginx.stubstatus-81181",  <-- filtered
      "_id: null", <-- filtered
      "_version: -4" <-- filtered
    ]
  }

The last 4 fields (_version_type, _index, _id and _version) are shown here to show their contents, but they would not be in the field with the script above.

In order to avoid failures in tests run by elastic-package, it is also required to skipped that new field. This skip can be added here:

Requires to skip this new field during the validation performed by elastic-package:
func skipValidationForField(key string) bool {
    return isFieldFamilyMatching("agent", key) ||
        isFieldFamilyMatching("elastic_agent", key) ||
        isFieldFamilyMatching("cloud", key) || // too many common fields
        isFieldFamilyMatching("event", key) || // too many common fields
        isFieldFamilyMatching("host", key) || // too many common fields
        isFieldFamilyMatching("metricset", key) || // field is deprecated
        isFieldFamilyMatching("event.module", key) || // field is deprecated
        isFieldFamilyMatching("doc.before_ingested", key) // field used to store the whole document with
}

Even with this new field which value is an encoded copy of the document, it would have similar issues, since it does not keep the same format as it was in the document. For instance:

a value 1234 can not be guaranteed if it comes from an string or a number
a value true can not be guaranteed if it comes from an string "true" or a boolean.

I've tried to look for another method/processor in the ingest pipeline to transform this to a JSON string, but I didn't find any way to achieve this. Could that be possible defining some other processor?

Would there be another option to get a copy of the document before being ingested?

Example of script processor keeping the same structure (objects, arrays, ...)

For completeness, the following script processor code would copy the document fields with the same structure. However, this would have the same issues when synthetic source, runtime fields or other features: ``` // Keeping the same format, but this will have the same issues // if synthetics is enabled or runtime fields. Map m = new HashMap(); for (entry in ctx.entrySet()) { // Just tested locally with a test package, and these fields were present // It should be tested with other packages to check if they can be removed // safely from the final field or not if (entry.getKey() == "_version") { continue } if (entry.getKey() == "_index") { continue } if (entry.getKey() == "_version_type") { continue } if (entry.getKey() == "_id") { continue } m[entry.getKey()] = entry.getValue(); } ctx['doc.before_ingested_map'] = m; ```

elastic / elastic-package

[Discuss] Store a encoded copy of the original document for fields validation #2016

Summary

Background