Open jsoriano opened 3 months ago
Done some tests manually in a Elastic stack locally, updating the final_pipeline
managed by Fleet.
Specifically, the test consisted on adding a new Script processor at the end of the final pipeline as:
def newField = []
for (entry in ctx.entrySet()) {
// Just tested locally with a test package, and these fields were present
// It should be tested with other packages to check if they can be removed
// safely from the final field or not
if (entry.getKey() == "_version") {
continue
}
if (entry.getKey() == "_index") {
continue
}
if (entry.getKey() == "_version_type") {
continue
}
if (entry.getKey() == "_id") {
continue
}
newField.add(entry.getKey()) = entry.getValue();
}
ctx['doc.before_ingested'] = newField;
Some fields are filtered in that code, since it looks like they are added or taken from the doc itself.
Adding that script processor, result in a new field like this:
{
"doc.before_ingested": [
"agent: {name=elastic-agent-83113, id=4711410d-f9bf-416b-8b5e-eb829b9866c1, type=metricbeat, ephemeral_id=6a54ede2-9e62-4b0b-ad0e-2614c02e489b, version=8.15.2}",
"@timestamp: 2024-10-03T14:27:14.869Z",
"nginx: {stubstatus={hostname=svc-nginx:80, current=10, waiting=0, accepts=343, handled=343, writing=1, dropped=0, reading=0, active=1, requests=378}}",
"ecs: {version=8.0.0}",
"service: {address=http://svc-nginx:80/server-status, type=nginx}",
"data_stream: {namespace=81181, type=metrics, dataset=nginx.stubstatus}",
"elastic_agent: {id=4711410d-f9bf-416b-8b5e-eb829b9866c1, version=8.15.2, snapshot=false}",
"host: {hostname=elastic-agent-83113, os={kernel=6.8.0-45-generic, codename=focal, name=Ubuntu, type=linux, family=debian, version=20.04.6 LTS (Focal Fossa), platform=ubuntu}, containerized=false, ip=[172.19.0.2, 172.18.0.7], name=elastic-agent-83113, id=93db770e92a444c98362aee1860ae326, mac=[02-42-AC-12-00-07, 02-42-AC-13-00-02], architecture=x86_64}",
"metricset: {period=10000, name=stubstatus}",
"event: {duration=282897, agent_id_status=verified, ingested=2024-10-03T14:27:15Z, module=nginx, dataset=nginx.stubstatus}",
"_version_type: internal", <-- filtered
"_index: metrics-nginx.stubstatus-81181", <-- filtered
"_id: null", <-- filtered
"_version: -4" <-- filtered
]
}
The last 4 fields (_version_type
, _index
, _id
and _version
) are shown here to show their contents, but they would not be in the field with the script above.
In order to avoid failures in tests run by elastic-package, it is also required to skipped that new field. This skip can be added here:
Requires to skip this new field during the validation performed by elastic-package:
func skipValidationForField(key string) bool {
return isFieldFamilyMatching("agent", key) ||
isFieldFamilyMatching("elastic_agent", key) ||
isFieldFamilyMatching("cloud", key) || // too many common fields
isFieldFamilyMatching("event", key) || // too many common fields
isFieldFamilyMatching("host", key) || // too many common fields
isFieldFamilyMatching("metricset", key) || // field is deprecated
isFieldFamilyMatching("event.module", key) || // field is deprecated
isFieldFamilyMatching("doc.before_ingested", key) // field used to store the whole document with
}
Even with this new field which value is an encoded copy of the document, it would have similar issues, since it does not keep the same format as it was in the document. For instance:
I've tried to look for another method/processor in the ingest pipeline to transform this to a JSON string, but I didn't find any way to achieve this. Could that be possible defining some other processor?
Would there be another option to get a copy of the document before being ingested?
Summary
Store a encoded copy of the original document with a processor in the final pipeline, before ingestion, and use this copy to validate fields and generate sample documents instead of rebuilding the document from the ingested data.
Split fields validations in two sets, one that uses this encoded copy, and another one for the indexed data.
Background
When validating fields we use the documents as they are stored in Elasticsearch. With the adoption of features like
constant_keyword
, runtime fields,synthetic
index mode, orindex: false
in packages it can be difficult to rebuild the original document. Some mappings could also introduce additional multifields, that in some cases we are ignoring, or have to ignore.We have now quite some code attempting to handle all these cases, and corner cases in combinations between them. Every time a new feature of this kind is added new corner cases appear.
Going back to the original objectives of these tests, we want to validate these two things:
With the current approach of checking the documents ingested as returned by the search API, we are missing the first point, as in many cases we don't have the data the package is generating, and we attempt to rebuild the documents from the indexed data.
So the proposal would be to explicitly split validations in two:
final_pipeline
.Some tests will do only one set of validations or both. The encoded copy could be additionally used for the generation of sample documents.