Data schemas

When collecting data with Elastic Agent and shipping to Elasticsearch, there are 4 different data schemas. This is important as the data schemas look different and we must align during generation on what data schemas we are talking about. In the diagram below, the schemas A, B, C and D are shown:

flowchart LR
    D[Data/Endpoint]
    EA[Elastic Agent]
    ES[Elasticearch]
    IP[Ingest Pipeline]
    D -->|Schema A| EA -->|Schema B| IP -->|Schema C| ES -->|Schema D| Query;

Schema A: This is the schema the Elastic Agent collects. It could be a line in a log file, response of an http request, syslog event etc. The Elastic Agent input knows how to handle this structure.
Schema B: This is the schema the Elastic Agent ships to Elasticsearch (to the ingest pipeline). This is a JSON document which contains all the processing the Elastic Agent did on schema A. For example the content of the log line is the in the message field and meta information around the host was added to the event.
Schema C: In case an ingest pipeline exists, the ingest pipeline converts schema B to schema C. This can be taking apart a log message with grok or enriching data with geoip. If there is no ingest pipeline, schema B and C are equal.
Schema D: This is the schema users write queries on in Elasticsearch. Schema C can be different from D in the scenario of runtime fields, otherwise C and D are equal.

Schema C is the one that is defined in integration packages in the fields.yml files for each data stream. In the following, our foucs is on Schema B with mentions of schema A.

Introduction:

Currently the corpus generator can produce only Schema C We want to support generating Schema A and Schema B

We figured out that in order to do so we need a way to generate data based on a template. A few investigating effort were made about this:

36 a first attempt to introduce a custom template, similar to golang text template syntax, supporting only replacing fields placeholder.
37 a second attempt to introduce support for golang text template engin: this will allow for some scripting features that will pair the features available in spigot (see: https://github.com/leehinman/spigot/blob/main/pkg/generator/aws/vpcflow/vpcflow.go#L99-L107)
38 a parallel attempt to move the script features to the config definition: a field can define a js function that is evaluated at runtime. a field can access the value of the fields already generated before as they are order in the template

Analysis:

36 was promising: especially performance are even better than the current generator, still we miss scripting features. the goal of having them is to produce as much as real-like data (ie: bytes a factor of packets, end and start date with a given time difference, etc etc). Both #37 and #38 provide this possibility but performance-wise we cannot replace none of the two a single generator. #38 specifically was suffering of very bad performance and was abandoned.

This PR:

We decided to go for having two different generator: according to performance required the consumer will trade ofd on generating real-like data for speed. On this path we decided to evaluate, parallel to golang text template, the two most performant template engine from compiled and not listed at https://github.com/SlinSo/goTemplateBenchmark: the two are JetHTML and Hero.

We run two different benchmark for each template engine:

JSONContent: producing Schema C data for "endpoint process 8.2.0" integration
VPCFlowLogs: producing Schema A data for aws vpc flow logs (beware the memory benchmark for Hero are misleading since they "happens" in a forked process)

We generated also directly from the build binaries 20GB of "aws dynamodb 1.28.3" Schema C data.

Here's are the results:

name                                    time/op
_GeneratorLegacyJSONContent-16          47.7µs ± 0%
_GeneratorHeroJSONContent-16             120µs ± 0%
_GeneratorCustomTemplateJSONContent-16  30.0µs ± 0%
_GeneratorJetHTMLJSONContent-16          162µs ± 0%
_GeneratorTextTemplateJSONContent-16     281µs ± 0%
_GeneratorCustomTemplateVPCFlowLogs-16  1.09µs ± 0%
_GeneratorHeroVPCFlowLogs-16            6.98µs ± 0%
_GeneratorJetHTMLVPCFlowLogs-16         8.01µs ± 0%
_GeneratorTextTemplateVPCFlowLogs-16    12.8µs ± 0%

name                                    alloc/op
_GeneratorLegacyJSONContent-16          3.82kB ± 0%
_GeneratorHeroJSONContent-16            16.9kB ± 0%
_GeneratorCustomTemplateJSONContent-16    432B ± 0%
_GeneratorJetHTMLJSONContent-16         19.6kB ± 0%
_GeneratorTextTemplateJSONContent-16    48.3kB ± 0%
_GeneratorCustomTemplateVPCFlowLogs-16   64.0B ± 0%
_GeneratorHeroVPCFlowLogs-16              493B ± 0%
_GeneratorJetHTMLVPCFlowLogs-16         1.36kB ± 0%
_GeneratorTextTemplateVPCFlowLogs-16    2.32kB ± 0%

name                                    allocs/op
_GeneratorLegacyJSONContent-16            22.0 ± 0%
_GeneratorHeroJSONContent-16              7.00 ± 0%
_GeneratorCustomTemplateJSONContent-16    14.0 ± 0%
_GeneratorJetHTMLJSONContent-16            752 ± 0%
_GeneratorTextTemplateJSONContent-16     2.23k ± 0%
_GeneratorCustomTemplateVPCFlowLogs-16    2.00 ± 0%
_GeneratorHeroVPCFlowLogs-16              7.00 ± 0%
_GeneratorJetHTMLVPCFlowLogs-16           40.0 ± 0%
_GeneratorTextTemplateVPCFlowLogs-16      95.0 ± 0%

$ time ./gen-legacy generate aws dynamodb 1.28.3 -t 20GB
File generated: /Users/andreaspacca/Library/Application Support/elastic-integration-corpus-generator-tool/corpora/1671594228-aws-dynamodb-1.28.3.ndjson

real    1m44.869s
user    1m6.599s
sys 0m37.354s

$ time ./gen-with-custom_template generate aws dynamodb 1.28.3 -t 20GB
File generated: /Users/andreaspacca/Library/Application Support/elastic-integration-corpus-generator-tool/corpora/1671611719-aws-dynamodb-1.28.3.ndjson

real    1m34.968s
user    0m55.029s
sys 0m37.175s

$ time ./gen-with-hero generate aws dynamodb 1.28.3 -t 20GB
File generated: /Users/andreaspacca/Library/Application Support/elastic-integration-corpus-generator-tool/corpora/1671612075-aws-dynamodb-1.28.3.ndjson

real    2m16.682s
user    1m40.784s
sys 1m57.518s
build   0m12.921s

$ time ./gen-with-jethtml generate aws dynamodb 1.28.3 -t 20GB
File generated: /Users/andreaspacca/Library/Application Support/elastic-integration-corpus-generator-tool/corpora/1671612253-aws-dynamodb-1.28.3.ndjson

real    4m8.893s
user    3m21.385s
sys 0m48.941s

$ time ./gen-with-text_template generate aws dynamodb 1.28.3 -t 20GB
File generated: /Users/andreaspacca/Library/Application Support/elastic-integration-corpus-generator-tool/corpora/1671612518-aws-dynamodb-1.28.3.ndjson

real    6m50.022s
user    6m10.642s
sys 0m50.909s

Outcome:

While Hero is near to perform as fast as the custom template solution, the caveats are many: we need to build a binary and fork its execution from the generator, also note that two processes will run in such scenario, each of them with the same amount of memory, doubling the memory required. Also the template syntax with its nine different block tags definition is a little over complicated. Until few days ago JetHTML had no new commit since Mar 5, 2021. It might that the project was stable enough or at the contrary it had no traction to required maintenance for. We have to properly investigate this since it would be the template to go with, but at the same time we don't want to depend on a non reliable external package.

elastic / elastic-integration-corpus-generator-tool

Templates benchmark #40

Data schemas

Introduction:

36 a first attempt to introduce a custom template, similar to golang text template syntax, supporting only replacing fields placeholder.

37 a second attempt to introduce support for golang text template engin: this will allow for some scripting features that will pair the features available in spigot (see: https://github.com/leehinman/spigot/blob/main/pkg/generator/aws/vpcflow/vpcflow.go#L99-L107)

38 a parallel attempt to move the script features to the config definition: a field can define a js function that is evaluated at runtime. a field can access the value of the fields already generated before as they are order in the template

Analysis:

This PR:

Outcome:

:green_heart: Build Succeeded

:robot: GitHub comments