elastic / elastic-integration-corpus-generator-tool

Command line tool used for generating events corpus dynamically given a specific integration
Other
21 stars 11 forks source link

Templates benchmark #40

Closed aspacca closed 1 year ago

aspacca commented 1 year ago

Data schemas

When collecting data with Elastic Agent and shipping to Elasticsearch, there are 4 different data schemas. This is important as the data schemas look different and we must align during generation on what data schemas we are talking about. In the diagram below, the schemas A, B, C and D are shown:

flowchart LR
    D[Data/Endpoint]
    EA[Elastic Agent]
    ES[Elasticearch]
    IP[Ingest Pipeline]
    D -->|Schema A| EA -->|Schema B| IP -->|Schema C| ES -->|Schema D| Query;

Schema C is the one that is defined in integration packages in the fields.yml files for each data stream. In the following, our foucs is on Schema B with mentions of schema A.

Introduction:

Currently the corpus generator can produce only Schema C We want to support generating Schema A and Schema B

We figured out that in order to do so we need a way to generate data based on a template. A few investigating effort were made about this:

Analysis:

36 was promising: especially performance are even better than the current generator, still we miss scripting features. the goal of having them is to produce as much as real-like data (ie: bytes a factor of packets, end and start date with a given time difference, etc etc). Both #37 and #38 provide this possibility but performance-wise we cannot replace none of the two a single generator. #38 specifically was suffering of very bad performance and was abandoned.

This PR:

We decided to go for having two different generator: according to performance required the consumer will trade ofd on generating real-like data for speed. On this path we decided to evaluate, parallel to golang text template, the two most performant template engine from compiled and not listed at https://github.com/SlinSo/goTemplateBenchmark: the two are JetHTML and Hero.

We run two different benchmark for each template engine:

We generated also directly from the build binaries 20GB of "aws dynamodb 1.28.3" Schema C data.

Here's are the results:

name                                    time/op
_GeneratorLegacyJSONContent-16          47.7µs ± 0%
_GeneratorHeroJSONContent-16             120µs ± 0%
_GeneratorCustomTemplateJSONContent-16  30.0µs ± 0%
_GeneratorJetHTMLJSONContent-16          162µs ± 0%
_GeneratorTextTemplateJSONContent-16     281µs ± 0%
_GeneratorCustomTemplateVPCFlowLogs-16  1.09µs ± 0%
_GeneratorHeroVPCFlowLogs-16            6.98µs ± 0%
_GeneratorJetHTMLVPCFlowLogs-16         8.01µs ± 0%
_GeneratorTextTemplateVPCFlowLogs-16    12.8µs ± 0%

name                                    alloc/op
_GeneratorLegacyJSONContent-16          3.82kB ± 0%
_GeneratorHeroJSONContent-16            16.9kB ± 0%
_GeneratorCustomTemplateJSONContent-16    432B ± 0%
_GeneratorJetHTMLJSONContent-16         19.6kB ± 0%
_GeneratorTextTemplateJSONContent-16    48.3kB ± 0%
_GeneratorCustomTemplateVPCFlowLogs-16   64.0B ± 0%
_GeneratorHeroVPCFlowLogs-16              493B ± 0%
_GeneratorJetHTMLVPCFlowLogs-16         1.36kB ± 0%
_GeneratorTextTemplateVPCFlowLogs-16    2.32kB ± 0%

name                                    allocs/op
_GeneratorLegacyJSONContent-16            22.0 ± 0%
_GeneratorHeroJSONContent-16              7.00 ± 0%
_GeneratorCustomTemplateJSONContent-16    14.0 ± 0%
_GeneratorJetHTMLJSONContent-16            752 ± 0%
_GeneratorTextTemplateJSONContent-16     2.23k ± 0%
_GeneratorCustomTemplateVPCFlowLogs-16    2.00 ± 0%
_GeneratorHeroVPCFlowLogs-16              7.00 ± 0%
_GeneratorJetHTMLVPCFlowLogs-16           40.0 ± 0%
_GeneratorTextTemplateVPCFlowLogs-16      95.0 ± 0%

$ time ./gen-legacy generate aws dynamodb 1.28.3 -t 20GB
File generated: /Users/andreaspacca/Library/Application Support/elastic-integration-corpus-generator-tool/corpora/1671594228-aws-dynamodb-1.28.3.ndjson

real    1m44.869s
user    1m6.599s
sys 0m37.354s

$ time ./gen-with-custom_template generate aws dynamodb 1.28.3 -t 20GB
File generated: /Users/andreaspacca/Library/Application Support/elastic-integration-corpus-generator-tool/corpora/1671611719-aws-dynamodb-1.28.3.ndjson

real    1m34.968s
user    0m55.029s
sys 0m37.175s

$ time ./gen-with-hero generate aws dynamodb 1.28.3 -t 20GB
File generated: /Users/andreaspacca/Library/Application Support/elastic-integration-corpus-generator-tool/corpora/1671612075-aws-dynamodb-1.28.3.ndjson

real    2m16.682s
user    1m40.784s
sys 1m57.518s
build   0m12.921s

$ time ./gen-with-jethtml generate aws dynamodb 1.28.3 -t 20GB
File generated: /Users/andreaspacca/Library/Application Support/elastic-integration-corpus-generator-tool/corpora/1671612253-aws-dynamodb-1.28.3.ndjson

real    4m8.893s
user    3m21.385s
sys 0m48.941s

$ time ./gen-with-text_template generate aws dynamodb 1.28.3 -t 20GB
File generated: /Users/andreaspacca/Library/Application Support/elastic-integration-corpus-generator-tool/corpora/1671612518-aws-dynamodb-1.28.3.ndjson

real    6m50.022s
user    6m10.642s
sys 0m50.909s

Outcome:

While Hero is near to perform as fast as the custom template solution, the caveats are many: we need to build a binary and fork its execution from the generator, also note that two processes will run in such scenario, each of them with the same amount of memory, doubling the memory required. Also the template syntax with its nine different block tags definition is a little over complicated. Until few days ago JetHTML had no new commit since Mar 5, 2021. It might that the project was stable enough or at the contrary it had no traction to required maintenance for. We have to properly investigate this since it would be the template to go with, but at the same time we don't want to depend on a non reliable external package.

elasticmachine commented 1 year ago

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

#### Build stats * Start Time: 2022-12-21T10:55:50.517+0000 * Duration: 3 min 16 sec #### Test stats :test_tube: | Test | Results | | ------------ | :-----------------------------: | | Failed | 0 | | Passed | 89 | | Skipped | 0 | | Total | 89 |

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with: - `/test` : Re-trigger the build.