Closed aspacca closed 1 year ago
the below badges are clickable and redirect to their specific view in the CI or DOCS
#### Build stats * Start Time: 2022-12-21T10:55:50.517+0000 * Duration: 3 min 16 sec #### Test stats :test_tube: | Test | Results | | ------------ | :-----------------------------: | | Failed | 0 | | Passed | 89 | | Skipped | 0 | | Total | 89 |
To re-run your PR in the CI, just comment with: - `/test` : Re-trigger the build.
Data schemas
When collecting data with Elastic Agent and shipping to Elasticsearch, there are 4 different data schemas. This is important as the data schemas look different and we must align during generation on what data schemas we are talking about. In the diagram below, the schemas A, B, C and D are shown:
message
field and meta information around the host was added to the event.Schema C is the one that is defined in integration packages in the
fields.yml
files for each data stream. In the following, our foucs is on Schema B with mentions of schema A.Introduction:
Currently the corpus generator can produce only
Schema C
We want to support generatingSchema A
andSchema B
We figured out that in order to do so we need a way to generate data based on a template. A few investigating effort were made about this:
36 a first attempt to introduce a custom template, similar to golang text template syntax, supporting only replacing fields placeholder.
37 a second attempt to introduce support for golang text template engin: this will allow for some scripting features that will pair the features available in spigot (see: https://github.com/leehinman/spigot/blob/main/pkg/generator/aws/vpcflow/vpcflow.go#L99-L107)
38 a parallel attempt to move the script features to the config definition: a field can define a js function that is evaluated at runtime. a field can access the value of the fields already generated before as they are order in the template
Analysis:
36 was promising: especially performance are even better than the current generator, still we miss scripting features. the goal of having them is to produce as much as real-like data (ie: bytes a factor of packets, end and start date with a given time difference, etc etc). Both #37 and #38 provide this possibility but performance-wise we cannot replace none of the two a single generator. #38 specifically was suffering of very bad performance and was abandoned.
This PR:
We decided to go for having two different generator: according to performance required the consumer will trade ofd on generating real-like data for speed. On this path we decided to evaluate, parallel to golang text template, the two most performant template engine from compiled and not listed at https://github.com/SlinSo/goTemplateBenchmark: the two are JetHTML and Hero.
We run two different benchmark for each template engine:
Schema C
data for "endpoint process 8.2.0" integrationSchema A
data for aws vpc flow logs (beware the memory benchmark for Hero are misleading since they "happens" in a forked process)We generated also directly from the build binaries 20GB of "aws dynamodb 1.28.3"
Schema C
data.Here's are the results:
Outcome:
While Hero is near to perform as fast as the custom template solution, the caveats are many: we need to build a binary and fork its execution from the generator, also note that two processes will run in such scenario, each of them with the same amount of memory, doubling the memory required. Also the template syntax with its nine different block tags definition is a little over complicated. Until few days ago JetHTML had no new commit since Mar 5, 2021. It might that the project was stable enough or at the contrary it had no traction to required maintenance for. We have to properly investigate this since it would be the template to go with, but at the same time we don't want to depend on a non reliable external package.