elastic / elastic-integration-corpus-generator-tool

Command line tool used for generating events corpus dynamically given a specific integration
Other
22 stars 12 forks source link

Adding initial template for kubernetes Pod #88

Closed gizas closed 1 year ago

gizas commented 1 year ago

I have created the initial template for kubernetes pod datastream. This tempate produces data ready to be ingested from rally tracks. See below the generated file comparing to real data extracted from cluster:

GENERATED-gotext.json.txt

Initial_from_realPOD.json.txt

Command to run:

/elastic-integration-corpus-generator-tool generate-with-template ./assets/templates/kubernetes.pod/gotext.tpl ./assets/templates/kubernetes.pod/fields.yml -c ./assets/templates/kubernetes.pod/configs.yml -y gotext -t 1000

Features to implement:

Findings after testing:

  1. Timestamps of generated data are spread within 1 hour from the time of the triggering the tool. Although generated data can be used for indexing, for the needs of visualisations return empty responses if queries are for time window larger than 1h. Next steps is to try to create the timestamps by using range (or cardinality) functions and spread them to multiple hours

  2. The generator tool produces output in multiple lines. The rally tool needs each entry in one line. For now this issue is minor as we provide our templates as one liners: https://github.com/elastic/elastic-integration-corpus-generator-tool/pull/88/files#diff-44eae17c43b58d9d956a9c89b53eed2aa72ef46ad7e10400f26a0a61cd22ccfcR17.

    • Also elastic-package tool dump command that will use the generator tool will dump the data in one line! Needs to be tested
  3. For every Rally run we need to generate the corpus data (eg. https://github.com/elastic/rally-tracks/pull/373/files#diff-dbafff74aad306950d4c38f30c7612f06cae89395c58311d54ca26a2c374fc03R52) and the mappings of the indices we test (eg. https://github.com/elastic/rally-tracks/pull/373/files#diff-0b2bc88dee0704c8bae38dbe5719417945216348c62273f09940d1afcb7a7eea). The need of corpus generation is matched here, but we still dont have a fully automated way to generate the mapping templates based on the relevant package version we test every time. We make the assumption that mappings dont change so often and can be extracted from any given cluster but still manual process is needed. Issue is not a blocker

elasticmachine commented 1 year ago

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

#### Build stats * Start Time: 2023-05-03T00:34:49.104+0000 * Duration: 3 min 42 sec #### Test stats :test_tube: | Test | Results | | ------------ | :-----------------------------: | | Failed | 0 | | Passed | 85 | | Skipped | 0 | | Total | 85 |

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with: - `/test` : Re-trigger the build.

gizas commented 1 year ago

Pod and container templates have been created:

For container datastream:

./elastic-integration-corpus-generator-tool generate-with-template ./assets/templates/kubernetes.container/gotext.tpl ./assets/templates/kubernetes.container/fields.yml -c ./assets/templates/kubernetes.container/configs.yml -y gotext -t 1000

For pod:

 ./elastic-integration-corpus-generator-tool generate-with-template ./assets/templates/kubernetes.pod/gotext.tpl ./assets/templates/kubernetes.pod/fields.yml -c ./assets/templates/kubernetes.pod/configs.yml -y gotext -t 1000
aspacca commented 1 year ago

/test

aspacca commented 1 year ago

@gizas all good here?

gizas commented 1 year ago

@gizas all good here?

Yes my tests are ok and I have managed to produce the wanted outcome. I will open this pr once I fully test it with rally track. I need to find time to run a full test and will come back to you

gizas commented 1 year ago

@aspacca , @martijnvg FYI: https://github.com/elastic/observability-dev/blob/generatortool/docs/infraobs/cloudnative-monitoring/dev-docs/elastic-generator-tool-with-rally.md

I have created the needed corpus and 55G files takes less than couple of minutes !!!! Lets see it in action

aspacca commented 1 year ago

@gizas we merged in master a change in the format of the fields generation config file from

- name: cloud.availabilit_zone
  value: "europe-west1-d"  
- name: agent.id
  value: "12f376ef-5186-4e8b-a175-70f1140a8f30"

to

fields:
  - name: cloud.availabilit_zone
    value: "europe-west1-d"  
  - name: agent.id
    value: "12f376ef-5186-4e8b-a175-70f1140a8f30"

we'll later introduce something like

formatter:
  - strip_newlines
fields:

so that you won't need to duplicate anymore the templates, like you do in this PR, and just have a single one "pretty-printed" that will be emitted with newlines stripped

please, feel free to suggest how the formatter "concept" should look like, thanks

aspacca commented 1 year ago

https://github.com/elastic/observability-dev/blob/generatortool/docs/infraobs/cloudnative-monitoring/dev-docs/elastic-generator-tool-with-rally.md

👍

please, let's align on what's currently available in elastic-package: https://github.com/elastic/elastic-package/blob/main/docs/howto/generate_corpus.md#generate-a-rally-track-for-a-package-dataset-and-run-a-rally-benchmark

gizas commented 1 year ago

Next steps is to try to create the timestamps by using range (or cardinality) functions and spread them to multiple hours

I have addressed this issue by manually produsing timestamps

Also relevant PR introduces the period flag that can be used in the future https://github.com/elastic/elastic-integration-corpus-generator-tool/commit/62a8465ba70aab20191b25dca6e6d797a7ab60fe

The generator tool produces output in multiple lines. The rally tool needs each entry in one line. We have created onliner versions of templates in this PR

Also the elastic-package-benchmark-generate-corpus command will output results in one line per doc entry

Generation of Rally Templates (as part of generatol tool)

Is not needed anymore, as I have tested the elastic-package dump installed-objects --package kubernetes command that can extract index templates. Relevant instructons added inside the Readme of TSDB2 rally track

aspacca commented 1 year ago
  1. Timestamps of generated data are spread within 1 hour from the time of the triggering the tool. Although generated data can be used for indexing, for the needs of visualisations return empty responses if queries are for time window larger than 1h. Next steps is to try to create the timestamps by using range (or cardinality) functions and spread them to multiple hours

this should be addressed by #95 :)

  1. The generator tool produces output in multiple lines. The rally tool needs each entry in one line. For now this issue is minor as we provide our templates as one liners: https://github.com/elastic/elastic-integration-corpus-generator-tool/pull/88/files#diff-44eae17c43b58d9d956a9c89b53eed2aa72ef46ad7e10400f26a0a61cd22ccfcR17.
    • Also elastic-package tool dump command that will use the generator tool will dump the data in one line! Needs to be tested

yes, elastic-package does it. still elastick-package use v0.5.0 of the corpus generator, so the assets in this PR won't be compatible for the moment. see next point comment for further details

  1. For every Rally run we need to generate the corpus data (eg. https://github.com/elastic/rally-tracks/pull/373/files#diff-dbafff74aad306950d4c38f30c7612f06cae89395c58311d54ca26a2c374fc03R52) and the mappings of the indices we test (eg. https://github.com/elastic/rally-tracks/pull/373/files#diff-0b2bc88dee0704c8bae38dbe5719417945216348c62273f09940d1afcb7a7eea). The need of corpus generation is matched here, but we still dont have a fully automated way to generate the mapping templates based on the relevant package version we test every time. We make the assumption that mappings dont change so often and can be extracted from any given cluster but still manual process is needed. Issue is not a blocker

similarly to what's done in this elastic-package PR, we'll add a benchmark rally command that will install the relevant assets of the local package. this will come with releasing v0.6.0 of the corpus generator and upgrading the dependency in elastic-package