Sample data generation - Githubissues

ruflin commented 2 years ago

When building integration packages, sample data is important to develop ingest pipeline and build dashboards. Unfortunately in most cases, real sample data is limited and often tricky to produce. This issues proposes a tool as part of elastic-package that can generate and load sample data.

Important: The following is only an initial proposal to better explain the problem and share existing ideas. A proper design is still required.

Why part of elastic-package

Generating sample data is not a new problem and there are several tools which already provide partial solutions to this. A tool to generate sample data in elastic-package is needed to make it available in a simple way to each package developers. How sample data should look and be generated becomes part of the package spec. Like this, someone building a package directly also gets the possibility of generating sample data and use it as part of the developer experience.

Data generation - metrics / logs

For the data generation, two different types of data exist. Metrics and traces are mostly already in the format that will be ingested into Elasticsearch and require very little processing. Logs on the other hand often come as raw messages and require ingest pipelines or runtime fields to structure the data. The goal is that the tool can generate both types of data but it can happen in iterations.

Metrics generation

For the generation of metrics, I suggest to take strong inspiration from the elastic-integration-corpus-generator-tool tool built by @aspacca. Instead of having to build separate config files, the config params for each field would be directly in the fields.yml of each data stream. The definition could look similar to the following:

- name: kubernetes.pod.network.rx.bytes
  type: long
  format: bytes
  unit: byte
  metric_type: counter
  description: |
    Received bytes
  _data_generation: # Discuss exact name
    fuzziness: 1000
    range: 10000

The exact syntax for each field type needs definition.

Logs generation

For logs generation, inspiration can be taken from the tool spigot by @leehinman. Ideally we could simplify this by allowing users to specify the message patterns something like {@timestamp} {source.ip} and the specify for these fields what the values should be. Then the tool would take over the generation of sample logs.

Important is that the log generation outputs message fields pre ingest pipeline.

Generated data format

The proposed data structure generated by the tool is the one used by esrally. It contains 1 JSON doc per line with all the fields inside. This makes it simple to just deliver the data to Elasticsearch and makes it possible to potentially reuse some of this generated data with rally tracks.

Non goals

A non goal of the data generation on loading of data is to replace rally. Rally measures the exact performance and builds reproducible benchmarks. When generating and loading data with elastic-package it is about testing ingest pipelines, testing dashboards and test queries on larger sets of data in an easy way. The focus is on package development.

Another non goal is to generate events that are related to each other. For some solutions it is important that if a host.name shows up other parts of the data contain the same host.name to be able to browse through the solution. This might be added at a later point but is not part of the scope.

Sample data storage

As the sample data can always be generated on the fly, it is not required to store it. If some of the sample data sets should be stored for later use, package-spec should provide a schema to reference sample datasets.

Command line

Command line arguments must be available to generate sample data for a dataset or a package and load it into Elasticsearch. Ideally package-spec allows to store some config files around which data sets can be generated so a package developer can share these configs as part of the package.

Initial packages to start with

I recommend to pick 2 initial packages to start with around the data generation. As k8s and AWS are both more complex package that also generate lots of data, this could be a good start focusing on the metrics part.

Future ideas

Use the data generation to test expected storage use per dataset. This can be used to compare storage use across versions but also help users predict how much storage will be needed.
Load sample data and run a report on the dashboards and export performance metrics as part of a pull request. The report will also help to see if some parts of a dashboard are broken
Real time event generation: Instead of pre-generating sample data, elastic-package could keep continuing shipping events to Elasticsearch

jsoriano commented 2 years ago

Pinging @marc-gr as he was thinking also about this problem in the context of pipelines benchmarking.

aspacca commented 2 years ago

For the generation of metrics, I suggest to take strong inspiration from the elastic-integration-corpus-generator-tool tool

I'd like to give some context about the rationale when developing the tool. I had two main goals:

relying on a source of truth (that's why based fields.yml)
being able to define cardinality and fuzziness

Satisfying point 1., at the current status, implies the data generated are post-ingest, but nothing prevents us to expand package-spec to include a definition of the pre-ingest schema. I've considered alternatives to this, from "playing backwards" the ingest pipelines to analysing a sample of pre-ingest data and the more I'm convinced that making space in package-spec for such schema is a preferable solution.

I still have to look in the tool spigot by @leehinman. I imagine the pre-ingest schema as a common solution for both metrics and logs, borrowing (as it is or from scratch) the possibility to define cardinality and fuzziness.

Ideally a third "tool", part of elastic package, can be developed to extract cardinality and fuzziness from existing ingested data in a cluster in order to initially feed the two above.

ruflin commented 2 years ago

Satisfying point 1., at the current status, implies the data generated are post-ingest, but nothing prevents us to expand package-spec to include a definition of the pre-ingest schema.

I was initially thinking that especially in the metrics case, this should not be too big of an issue. But we are also moving more processing around metrics to ingest pipelines instead of Elastic Agent so the metrics data coming in might look quite a bit different. Something to investigate further.

leehinman commented 2 years ago

I don't know if this needs to be part of elastic-package code base. I'm wondering if it could be something like https://github.com/elastic/stream/ where we supply a config and it is spun up as a docker instance during elastic-package test.

Ideally it would be nice if the sample generation tool could be run several ways:

1) in elastic-package test environment 2) as an "input" that elastic-agent could deploy, so customers could generate data for integrations (demo or testing use case) 3) stand alone, so you could generate rally tracks, local testing, etc.

aspacca commented 2 years ago

I don't know if this needs to be part of elastic-package code base.

I agree with this in the terms of separated repos that provides both standalone cli commands and packages to be consumed directly from elastic-package, without the need to wrap them in a separate process

leehinman commented 2 years ago

With vpcflow we found a "problem" with the logs that spigot generated. The vpcflow data was fine, but if we made a rally track out of that data, it was missing fields that the filebeat awss3 input adds to the event. Without those fields we couldn't drive the issue.

So I think we really want to have one or more things that can generate logs, and then for both filebeat & logstash have outputs that make rally tracks. That way we don't have to duplicate the fields added by inputs and we don't have to duplicate any user defined processors either, filebeat or logstash will do that like they do in production. This also has another benefit, if we don't have a log generation tool, we can capture real data and make a rally track.

aspacca commented 2 years ago

The vpcflow data was fine, but if we made a rally track out of that data, it was missing fields that the filebeat awss3 input adds to the event. So I think we really want to have one or more things that can generate logs, and then for both filebeat & logstash have outputs that make rally tracks.

I guess the missing fields could be generated by https://github.com/elastic/elastic-integration-corpus-generator-tool, since they should be part of "fields.yml", but for _id

the tool at the moment can only generate post-ingestion pipeline documents, but with some jq post-processing I was able to generate the source vpcflow logs. if we defined some specs for generating back to pre-ingestion pipeline from post one it should not a big deal to incorporate this feature in the tool instead of relying on external post-processing

please let me know if you'd like to work together on the topic :)

leehinman commented 2 years ago

I guess the missing fields could be generated by https://github.com/elastic/elastic-integration-corpus-generator-tool, since they should be part of "fields.yml", but for _id

the fields.yml files are getting more complete, but there will always be gaps. maybe if we could mix in additional fields that we find are missing that might take care of the gaps.

the tool at the moment can only generate post-ingestion pipeline documents, but with some jq post-processing I was able to generate the source vpcflow logs. if we defined some specs for generating back to pre-ingestion pipeline from post one it should not a big deal to incorporate this feature in the tool instead of relying on external post-processing

vpcflow processing is pretty minimal, some other integrations will make putting the original back together from the result very complicated. Others like Cloudtrail would be pretty easy to re-assemble from the results.

please let me know if you'd like to work together on the topic :)

I'm definitely interested in making sure we can generate source documents for all of our integrations.

Even if we get the corpus generator to be nearly perfect, I'm still in favor of providing a turn key way for customers to generate rally tracks from their own data. That way we can run tests with the exact data that is causing the problem.

cmacknz commented 2 years ago

I'm still in favor of providing a turn key way for customers to generate rally tracks from their own data

Do we need them to generate rally tracks, or is just capturing the raw events enough and we could post-process them into a rally track ourselves? I wonder if adding the ability to tee events to both a file and the actual target output would help here.

I definitely like the idea of having something built into Agent that users can enable in production to give us the exact events they are experiencing issues with. I think this will eliminate a lot of back and forth and wasted time because we could have exactly the data that is causing problems, with possible caveats like having to sanitize out personally identifiable information or credentials.

leehinman commented 2 years ago

Do we need them to generate rally tracks, or is just capturing the raw events enough and we could post-process them into a rally track ourselves? I wonder if adding the ability to tee events to both a file and the actual target output would help here.

Doesn't have to be rally, but we should have a simple way of converting to rally. We could probably have a small utility that takes the existing file output and turns that into rally tracks.

I really like the tee idea. We would need to ignore ACKs from secondary output, and make sure that a slow secondary output doesn't slow down the primary output.

cconboy commented 2 years ago

Not certain if it offers anything beyond other tools mentioned above, but there also exists Logen for generating logs https://github.com/elastic/logen https://docs.google.com/presentation/d/1I2ZKQo-Rbr18l05Lrp-lnUk-vIM3ZRpnb-xjaPdxShQ/edit#slide=id.p1

cavokz commented 2 years ago

Talking of tools, there is also elastic/geneve. We don't have a good summary of what it does though there are some technical docs at https://github.com/elastic/geneve/tree/main/docs and https://github.com/elastic/geneve/tree/main/tests/reports.

The juice is that you describe (in a so called data model) what kind of documents you need and then Geneve will generate as many as you want. A data model can be as simple as "these fields need to be present" to something more complex like "the documents need to have this relation: first doc has some content inprocess.name, the second doc has process.parent.name set to whatever was generated for the first one".

Geneve was born to generate documents that would trigger detection rules in the Security app but nothing forbids to describe other kinds of fields/documents relations and use the generated documents for other purposes. Indeed we are currently working with the Analyst Experience team to help them filling their stack with data in a flexible way, this would allow them to use and develop Kibana in ways that are now not easily feasible.

An example of data model is:

sequence by host.id with maxspan=1m
 [file where event.type != "deletion" and file.path in ("/System/Library/LaunchDaemons/*", "/Library/LaunchDaemons/*")]
 [process where event.type in ("start", "process_started") and process.name == "launchctl" and process.args == "load"]

Example of the four (*) pairs of documents that can be generate:

[{'event': {'type': ['ZFy'], 'category': ['file']}, 'file': {'path': '/System/Library/LaunchDaemons/UyyFjSvILOoOHmx'}, 'host': {'id': 'BnL'}, '@timestamp': 0},
 {'event': {'type': ['start'], 'category': ['process']}, 'process': {'name': 'launchctl', 'args': ['load']}, 'host': {'id': 'BnL'}, '@timestamp': 1},
 {'event': {'type': ['eOA'], 'category': ['file']}, 'file': {'path': '/System/Library/LaunchDaemons/gaiFqsyzKNyyQ'}, 'host': {'id': 'DpU'}, '@timestamp': 2},
 {'event': {'type': ['process_started'], 'category': ['process']}, 'process': {'name': 'launchctl', 'args': ['load']}, 'host': {'id': 'DpU'}, '@timestamp': 3},
 {'event': {'type': ['EUD'], 'category': ['file']}, 'file': {'path': '/Library/LaunchDaemons/xVTOLWtimrFgT'}, 'host': {'id': 'msh'}, '@timestamp': 4},
 {'event': {'type': ['start'], 'category': ['process']}, 'process': {'name': 'launchctl', 'args': ['load']}, 'host': {'id': 'msh'}, '@timestamp': 5},
 {'event': {'type': ['CeL'], 'category': ['file']}, 'file': {'path': '/Library/LaunchDaemons/L'}, 'host': {'id': 'Sjo'}, '@timestamp': 6},
 {'event': {'type': ['process_started'], 'category': ['process']}, 'process': {'name': 'launchctl', 'args': ['load']}, 'host': {'id': 'Sjo'}, '@timestamp': 7}]

* Why four pairs? Because the model above has four branches and Geneve can explore all of them individually.

In principle Geneve is a "constraints solver", the data model is indeed a way to describes constraints to the data generation process. Relations between fields/documents are indeed constraints to the otherwise completely open solution space from which geneve draws its "solutions".

When it happens that the solution space is empty, then some conflicting constraints are present (eg: destination.port == 22 and destination.port in (80, 443)) and no solutions can be found, an error is reported. This is a very useful way to detect queries that cannot possibly be ever satisfied by any dataset.

TBC (in some more suitable place)

ruflin commented 1 year ago

In the last few weeks, work has been done on improving the elastic-integration-corpus-generator-tool and trying out multiple template approaches: https://github.com/elastic/elastic-integration-corpus-generator-tool/issues/39 Even though this work is not completed, I'm putting together here a more concret proposal on how all the pieces could work together to build an end-to-end experience around elastic-package. Below I bring up very specific examples but the exact names are less important than the concepts. If we go down the path of implementation, the details will likely change.

Data schemas

When collecting data with Elastic Agent and shipping to Elasticsearch, there are 4 different data schemas. This is important as the data schemas look different and we must align during generation on what data schemas we are talking about. In the diagram below, the schemas A, B, C and D are shown:

flowchart LR
    D[Data/Endpoint]
    EA[Elastic Agent]
    ES[Elasticearch]
    IP[Ingest Pipeline]
    D -->|Schema A| EA -->|Schema B| IP -->|Schema C| ES -->|Schema D| Query;

Schema A: This is the schema the Elastic Agent collects. It could be a line in a log file, response of an http request, syslog event etc. The Elastic Agent input knows how to handle this structure.
Schema B: This is the schema the Elastic Agent ships to Elasticsearch (to the ingest pipeline). This is a JSON document which contains all the processing the Elastic Agent did on schema A. For example the content of the log line is the in the message field and meta information around the host was added to the event.
Schema C: In case an ingest pipeline exists, the ingest pipeline converts schema B to schema C. This can be taking apart a log message with grok or enriching data with geoip. If there is no ingest pipeline, schema B and C are equal.
Schema D: This is the schema users write queries on in Elasticsearch. Schema C can be different from D in the scenario of runtime fields, otherwise C and D are equal.

Schema C is the one that is defined in integration packages in the fields.yml files for each data stream. In the following, our foucs is on Schema B with mentions of schema A.

Schema B generation

Schema B is always in JSON format and is the output generated by Elastic Agent. It contains the meta information about the event itself like host or k8s metadata. As schema B is in JSON format, shipping it to Elasticsearch in theory could be done by a curl request taking the json doc as the body of the document. Sending it to the correct data stream, processing would also happen and data is persisted.

The elastic-integration-corpus-generator-tool has ways to generate data based on some config optins and templates. In https://github.com/elastic/elastic-integration-corpus-generator-tool/issues/39 multiple approaches for different templating are discussed. What all have in common are:

Event template: Template for the event to be generated with variables inside
Fields Configuration: Definition of each field on how it should be generated

What I skipped above is the fields definition for Elaticsearch which is contained in the tool but is not needed in the context of packages as this is Schema C is already defined as part of the package. What is needed in addition is a configuration file for the data generator to deside how much data should be generated, time range etc. In the tool this is currently done through command line parameters.

The assumption is that for a single dataset in an integration package, different scenarios could be generated. Lets take package foo with dataset bar as an example. The following files would exist:

foo/data_stream/bar/_dev/data_generation/config.yml
foo/data_stream/bar/_dev/data_generation/template1.tmpl
foo/data_stream/bar/_dev/data_generation/template2.tmpl
foo/data_stream/bar/_dev/data_generation/template1-config.yml
foo/data_stream/bar/_dev/data_generation/template2-config.yml

In the example above, 2 templates each with a config file are used. The .tmpl file contains the json template for the event and template1-config.yml contains the definition of the fields for the template. It would be possible to have just one definition for each template.

The config.yml contains a list of scenarios that should be generated. It could look similar to:

data_generation:
- name: short-sample
  timerange: 2d
  events: 1000
  template: template1
  # See spigot for more options https://github.com/leehinman/spigot
  output: elasticsearch
- name: middle-sample
  timerange: 10d
  events: 10000
  template: template1
  output: elasticsearch
- name: large-sample
  timerange: 2d
  events: 1000000
  template: template2
  output: rally

More config options could be added. The goal is to show that multiple data generations can be configured. Having all the setup done, elastic-package can be used to generate the data:

elastic-package data generate --package=bar --dataset=foo --name=large-sample

The parameters are optional. If the command is run inside a package, it would apply it to all dataset and all tasks by default or one can be selected. As can be seen in the above example, an output format can also be specified. The data can be stored in rally track format or sent to Elasticsearch directly.

Behind the scenes, elastic-integration-corpus-generator-tool is used to generated the events out of the templates and spigot to generate the relevant outputs.

Schema A generation

The generation of schema A would look very similar. But to ship schema A to Elasticsearch, a running Elastic Agent is needed. Similar to schema B, package-spec could contain config options on how to generate it. It would require in addition some logic around how schema A is collected to run an Elastic Agent for collection of it. There is a good chance, all these configs already exist in the data stream and can be used.

Generation of schema B I see as only second priority of this project.

Rally track generation

One of the goals of the generation of schema B is to be able to create rally track compatible data. To create a full rally track, it is required to also export templates, ingest pipelines from the package. elastic-package already has an export command to make use of this.

In an ideal scenarios, a user could run elastic-package benchmark --dataset=foo and behind the scenes, data would be generated, rally track created, setup is done with esbench, pipelines and templates are loaded and data is ingested to Elasticsearch through the pipeline and measurements are provided on the performance for this run. At first, there might be some additional manual steps required.

alexsapran commented 1 year ago

Some thoughts on the benchmark topics, but don't want to sidetrack the discussion, maybe we can sync offline.

In an ideal scenarios, a user could run elastic-package benchmark --dataset=foo and behind the scenes, data would be generated, rally track created, setup is done with esbench, pipelines and templates are loaded and data is ingested to Elasticsearch through the pipeline and measurements are provided on the performance for this run. At first, there might be some additional manual steps required.

It would be nice to have elastic-package orchestrate whatever is required to execute a benchmark in the context of the integration itself, meaning start agent, configure agent, configure remote ES (index-template, pipelines, ....), start the agent, monitor the agent, tear down the agent, collect results. Having elastic-package` focused only on the component and not orchestrating a more complicated full environment would be a good starting point.

marc-gr commented 1 year ago

Some thoughts on the benchmark topics, but don't want to sidetrack the discussion, maybe we can sync offline.

In an ideal scenarios, a user could run elastic-package benchmark --dataset=foo and behind the scenes, data would be generated, rally track created, setup is done with esbench, pipelines and templates are loaded and data is ingested to Elasticsearch through the pipeline and measurements are provided on the performance for this run. At first, there might be some additional manual steps required.

It would be nice to have elastic-package orchestrate whatever is required to execute a benchmark in the context of the integration itself, meaning start agent, configure agent, configure remote ES (index-template, pipelines, ....), start the agent, monitor the agent, tear down the agent, collect results. Having elastic-package` focused only on the component and not orchestrating a more complicated full environment would be a good starting point.

In these lines, I drafted this for my own reference, so is not intended to be complete but just a bigger picture of how this would be for cases such as the Schema A scenarios mentioned above

So ideally we would have all required things to run benchmarks self contained in elastic-package and as part of the integration definitions, and esbench is going to be more a description for a more permanent benchmark setup as it is today for other cases if I am not mistaken.

endorama commented 1 year ago

Following up on what @marc-gr wrote, we can also consider that data generation and data usage may not necessarily happen consequently and keep this decoupling in mind. The benefit would be that generator tools would be able to generate and store data in a generic storage and tool that leverages those data would be able to "replay" those data without the generation step, which may be compute intensive (thus either being bounded by compute resources or requiring extensive resources to be run at the desired scale). (As discussed with @ruflin some tools, like https://github.com/elastic/rally, already supports loading from S3).

ruflin commented 1 year ago

can also consider that data generation and data usage may not necessarily happen consequently and keep this decoupling in mind

++, we should not only consider it but make sure it is decouple. I expect by default when data generation is used, the data is written to disk (in some format, maybe rally format?). That doesn't mean there can be eventually commands that bring it all together in one flow.

aspacca commented 1 year ago

Status update:

PR introducing the possibility to generate data based on a template (full go text/template package plus sprig functions, stripped go text/template package supporting only placeholders): this unlocks the possibility to generate any Schema X data

Next steps (priority order to be defined):

Create fields definitions, configs and templates (both engines) for current spigot generator
Create fields definitions, configs and templates (both engines) for AWS existing data_streams (to be split in several iterations and prioritised)
Align/define on the _dev/data_generation/schema generation proposal from @ruflin
Add support for different outputs: I propose to start from spigot ones
Add elastic-package command for generating data based on a template ("hook" to genlib. NewGeneratorWithCustomTemplate)
Add elastic-package command for benchmark (as for Rally track generation proposal from @ruflin

susan-shu-c commented 1 year ago

Thank you @aspacca !

I have an additional question about elastic-integration-corpus-generator-tool: could it be used to resolve an issue where in a fresh cluster install, before the first alert is generated, the .alerts-security.alerts-default] index doesn't exist? For example, an integration package that has a transform that reads from that index, will error with no such index [.alerts-security.alerts-default]; until we go in and make sure an alert is created. Example, the CI job failing on this PR

aspacca commented 1 year ago

hi @susan-shu-c

I have an additional question about elastic-integration-corpus-generator-tool: could it be used to resolve an issue where in a fresh cluster install, before the first alert is generated, the .alerts-security.alerts-default] index doesn't exist? For example, an integration package that has a transform that reads from that index, will error with no such index [.alerts-security.alerts-default];

elastic-integration-corpus-generator-tool just generates data on the machine you run it on: nothing prevents you to generate a template that contains the payload of a bulk request for .alerts-security.alerts-default, still you have to send it to the cluster. you have to add this in your CI/wherever as preemptive steps.

unless you need to generate a lot of data with random content, there's probably no need to make use of elastic-integration-corpus-generator-tool, since what would solve the issue in CI (as long as I understand the issue) is ingesting documents in .alerts-security.alerts-default.

on the broader scope of having an elastic-package benchmark command your case is anyway something that has to be addressed, because potentially you should be able to run something like elastic-package benchmark host_risk_score

I'm not familiar with security integrations: I see for example that hot_risk_score does not have a ./data_stream/billing/fields folder, that's where elastic-integration-corpus-generator-tool gets the information in order to generate "Schema C" data

aspacca commented 1 year ago

cc @leehinman

endorama commented 1 year ago

Updates on the work being done. The first iteration is in progress, we are focusing on adding templates for aws.ec2_logs, aws.ec2_metrics, aws.billing, aws.sqs and integrate generation with elastic-package.

Week 5 (Jan 30 - Feb 2): PRs:

Ongoing discussions:

endorama commented 1 year ago

We are still in the first iteration, with the same goals. There have been slow progresses in the last week due to SDH duties.

Week 6 (Feb 6-10): PR opened:

https://github.com/elastic/elastic-integration-corpus-generator-tool/pull/54

Week 7 (Feb 13-17): PR Merged:

Week still in progress, we expect to open a new PR with aws.ec2_logs template and merge aws.billing template.

endorama commented 1 year ago

Week 9 (Feb 27 - Mar 2):

New PRs:

https://github.com/elastic/elastic-integration-corpus-generator-tool/pull/65

Merged PRs:

Various updates to dependencies

endorama commented 1 year ago

First iteration is still ongoing. Progress have been made in elastic-package and templates.

Week 10 (Mar 6 - 10):

Merged:

dependencies updates
https://github.com/elastic/elastic-integration-corpus-generator-tool/pull/66 (code cleanup and better generation error management)
https://github.com/elastic/elastic-integration-corpus-generator-tool/pull/43
https://github.com/elastic/elastic-package/pull/1122

endorama commented 1 year ago

We are only missing one template to close the first iteration.

Week 11 (Mar 13-16)

Discussions:

New:

Merged: none

endorama commented 1 year ago

With today we conclude the first iteration. All planned templates are available.

Week 12 (Mar 20 - 24):

Merged:

https://github.com/elastic/elastic-integration-corpus-generator-tool/pull/65

Releases: v0.5.0

Related:

merged https://github.com/elastic/elastic-package/pull/1196

bturquet commented 1 year ago

2023-06-15 Update on the project

1st iteration, Creating templates (complete)

Done

During the first iteration we created schema B templates for aws.ec2_logs, aws.ec2_metrics, aws.billing, aws.sqs and k8s

2nd iteration, Adding commands to elastic-package (in progress)

Done

Library refactoring

https://github.com/elastic/elastic-integration-corpus-generator-tool/pull/98

In progress

Adding benchmark rally command

https://github.com/elastic/elastic-package/pull/1232

Next (dependency with the previous one)

Duplicate the content directly to elastic-package (PR needs to be created, est. 1 day)
Update the benchmark generate-data command to use assets from package instead of generator’s repo (PR needs to be created, est. a few days)

bturquet commented 1 year ago

Regarding

https://github.com/elastic/elastic-package/pull/1232

The package-spec PR has been merged, we are now waiting for a new release, before doing it we have this pending PR (under review)

https://github.com/elastic/package-spec/pull/535

bturquet commented 1 year ago

All the PR dependencies have been merged, @aspacca could you launch the next steps please ?

bturquet commented 1 year ago

@aspacca Could we share an update of where we are today and what are the next and remaining steps please ?

aspacca commented 1 year ago

2023-09-25 Update on the project

2nd iteration

In progress

[x] Change size param to totEvents param in generator's API [~PR~]
[ ] Fix system benchmark in elastic-package according to new generator's API [~package-spec PR~, elastic-package issue]
[ ] Add support to rally benchmark in elastic-package [issue]
[ ] Duplicate the schema-b content from the generator's repo directly to elastic-package [issue]
[ ] Update the benchmark generate-data command to use assets from package instead of generator’s repo [issue]

aspacca commented 1 year ago

2023-10-12 Update on the project

2nd iteration

In progress

[x] Change size param to totEvents param in generator's API [~PR~]
[ ] Fix system benchmark in elastic-package according to new generator's API [~package-spec PR~, elastic-package issue]. no work left, blocked by package-spec@v3
[ ] Add support to rally benchmark in elastic-package [issue]. est. 5 days of coding, with potential external dependencies
[ ] Duplicate the schema-b content from the generator's repo directly to elastic-package [issue]. est. 1 day of coding for each dataset's integration. no external dependencies
[ ] Update the benchmark generate-data command to use assets from package instead of generator’s repo [issue]. est. 5 days of coding, external dependency (task above) can be mocked.

ruflin commented 1 year ago

no work left, blocked by package-spec@v3

@aspacca Can you share some more details on what part of v3 this is blocked on? Anything we can do on our end to get this unblocked?

Duplicating schemas

Will the schemas change in some ways when moving to elastic-package? Context I'm asking: After the moving 1-2 datasets as examplse, could the teams themself move the assets over?

aspacca commented 1 year ago

Can you share some more details on what part of v3 this is blocked on? Anything we can do on our end to get this unblocked?

we dropped a deprecated field and renamed the new one, taking the occasion of the breaking change in v3 (no one but for package-spec and tests in elastic-package is using yet the spec for system/rally benchmark anyway, so we didn't need to support a migration path). the PR is blocked waiting for the spec changes described above will be merged in v3 in order to pass CI

Duplicating schemas Will the schemas change in some ways when moving to elastic-package? Context I'm asking: After the moving 1-2 datasets as examplse, could the teams themself move the assets over?

[x] Remove benchmark generate-data command [issue]. waiting for review
[ ] Add benchmark stream command [issue]. est. 5 days.
[ ] Add loading a custom template for TSDB range in benchmark rally command [issue]. est. 3 days.

aspacca commented 1 year ago

2023-11-14 Update on the project

2nd iteration

[ ] Duplicate the schema-b content from the generator's repo directly to elastic-package [issue]. est. 1 day of coding for each dataset's integration. 6/6 through, waiting for review
3nd iteration

In progress
[x] Remove benchmark generate-data command [issue, PR]. merged
[ ] Add benchmark stream command [issue]. est. 5 days
[ ] Add loading a custom template for TSDB range in benchmark rally command [issue, PR]. waiting for review
[ ] Make generator tool base now timestamp (now - period) + ((period / tot events) * nth event) [PR]

aspacca commented 1 year ago

2023-11-23 Update on the project

2nd iteration

[ ] Duplicate the schema-b content from the generator's repo directly to elastic-package [issue]. est. 1 day of coding for each dataset's integration. 3/6 merged, 3/6 waiting for review
3nd iteration

In progress
[x] Remove benchmark generate-data command [issue, PR]. merged
[ ] Add benchmark stream command [issue]. est. 5 days
[ ] Add loading a custom template for TSDB range in benchmark rally command [issue, PR]. ready to merge
[x] Make generator tool base now timestamp (now - period) + ((period / tot events) * nth event) [PR]
[ ] Add range.from/to for type date in the generator tool [PR]. waiting for review

aspacca commented 1 year ago

2023-11-24 Update on the project

2nd iteration

[ ] Duplicate the schema-b content from the generator's repo directly to elastic-package [issue]. est. 1 day of coding for each dataset's integration. 5/6 merged, 1/6 waiting for review
3nd iteration

In progress
[x] Remove benchmark generate-data command [issue, PR]. merged
[ ] Add benchmark stream command [issue]. est. 5 days
[x] Add loading a custom template for TSDB range in benchmark rally command [issue, PR].
[x] Make generator tool base now timestamp (now - period) + ((period / tot events) * nth event) [PR]
[ ] Add range.from/to for type date in the generator tool [PR]. waiting for review
[ ] elastic-package benchmark rally: support install package from registry and local corpus. est. 1 day

aspacca commented 1 year ago

2023-11-30 Update on the project

2nd iteration

[ ] Duplicate the schema-b content from the generator's repo directly to elastic-package [issue]. est. 1 day of coding for each dataset's integration. 5/6 merged, 1/6 waiting for review
3nd iteration

In progress
[x] Remove benchmark generate-data command [issue, PR]. merged
[ ] Add benchmark stream command [issue]. est. 5 days
[x] Add loading a custom template for TSDB range in benchmark rally command [issue, PR].
[x] Make generator tool base now timestamp (now - period) + ((period / tot events) * nth event) [PR]
[ ] Add range.from/to for type date in the generator tool [PR]. waiting for review
[ ] elastic-package benchmark rally: support install package from registry and local corpus. [PR]. waiting for review

aspacca commented 12 months ago

2023-12-04 Update on the project

2nd iteration

[x] Duplicate the schema-b content from the generator's repo directly to elastic-package [issue].
3nd iteration

In progress
[x] Remove benchmark generate-data command [issue, PR]. merged
[ ] Add benchmark stream command [issue. PR]. waiting for review
[x] Add loading a custom template for TSDB range in benchmark rally command [issue, PR].
[x] Make generator tool base now timestamp (now - period) + ((period / tot events) * nth event) [PR]
[x] Add range.from/to for type date in the generator tool [PR]
[x] elastic-package benchmark rally: support install package from registry and local corpus. [PR]. waiting for review
[ ] dump all the integrations ES assets for benchmark rally command[issue]

aspacca commented 10 months ago

2024-01-09 Update on the project

3nd iteration

In progress

[x] Remove benchmark generate-data command [issue, PR]. merged
[x] Add benchmark stream command [issue. PR]. waiting for review
[x] Add loading a custom template for TSDB range in benchmark rally command [issue, PR].
[x] Make generator tool base now timestamp (now - period) + ((period / tot events) * nth event) [PR]
[x] Add range.from/to for type date in the generator tool [PR]
[x] elastic-package benchmark rally: support install package from registry and local corpus. [PR]. waiting for review
[ ] dump all the integrations ES assets for benchmark rally command[issue]

aspacca commented 10 months ago

2024-01-16 Update on the project

3nd iteration

In progress

[x] Remove benchmark generate-data command [issue, PR]. merged
[x] Add benchmark stream command [issue. PR]. waiting for review
[x] Add loading a custom template for TSDB range in benchmark rally command [issue, PR].
[x] Make generator tool base now timestamp (now - period) + ((period / tot events) * nth event) [PR]
[x] Add range.from/to for type date in the generator tool [PR]
[x] elastic-package benchmark rally: support install package from registry and local corpus. [PR]. waiting for review
[ ] dump all the integrations ES assets for benchmark rally command[issue]

aspacca commented 9 months ago

2023-02-13: Final update on the issue.

integrations repo

The assets (templates, fields.yml and config.yml) were generated for the following datasets in the integrations repo:

aws.ec2_logs
aws.ec2_metrics
aws.billing
aws.sqs
kubernetes.state_container
kubernetes.node
nginx.access
mysql.slowlog
mysql.error
mysql.galera_status
mysql.status
nginx.stubstatus
nginx.access
nginx.error
mysql.performance
kubernetes.container
kubernetes.pod

Continuous refinement is ongoing on some existing assets, new assets for new datasets are continuously added

`elastic-package` repo

Added elastic-package benchmark rally in order to generate and run a rally track from the root folder of an integration for a specific dataset. Several options are provided, like only generating the rally track with the related corpus, persisting the rally track and the related corpus or replaying an existing generated rally track with the related corpus
Added elastic-package benchmark stream in order to streaming ingestion to an ES cluster from the root folder of an integration for one or multiple datasets at once. An option to backfill events for a configurable amount of time before having run the command is provided.

Further enhancements are already planned, like decoupling the location where the commands need to be launched from (root folder of an integration), improving automation experience according to relevant audiences, and an internal refactoring of the existing duplicated code among the other things, but not limited to those.

`elastic-integration-corpus-generator-tool` repo

Added the possibility to define a specific seed for the rand package and time to be used as Time.Now() in order to generate reproducible content
Total content to be generated is now indicated by amount of events instead of content size
Added range.from/to for date fields in order to set time bounds in the generated values (similar to numeric range.min/max)
Enforced progressive and orderer generation of @timestamp field

Adding support for counter numeric field is already ongoing and a big refactoring of configuration around cardinality is planned and already designed and it will be the next first future implementation. This refactoring is a breaking change we deemed it is necessary and its highest priority is given exactly by the fact the if we proceed with it now the impact will be fairly reduced.

We identified the following areas of ownership:

benchmarking, e2e reporting -> @ruflin
benchmark tooling, make it seamless -> @gizas
elastic-package for “outside” users -> @aspacca , @jsoriano
template generation: @lalit-satapathy
elastic-package / corpus changes: @aspacca

elastic / elastic-package

Sample data generation #984

Why part of elastic-package

Data generation - metrics / logs

Metrics generation

Logs generation

Generated data format

Non goals

Sample data storage

Command line

Initial packages to start with

Future ideas

Data schemas

Schema B generation

Schema A generation

Rally track generation

Week 11 (Mar 13-16)

Week 12 (Mar 20 - 24):

2023-06-15 Update on the project

1st iteration, Creating templates (complete)

Done

2nd iteration, Adding commands to elastic-package (in progress)

Done

In progress

Next (dependency with the previous one)

2023-09-25 Update on the project

2nd iteration

In progress

2023-10-12 Update on the project

2nd iteration

In progress

2023-10-17 Update on the project

2nd iteration

In progress

2023-10-24 Update on the project

2nd iteration

In progress

2023-10-31 Update on the project

2nd iteration

In progress

2023-11-07 Update on the project

2nd iteration

In progress

2023-11-10 Update on the project

3nd iteration

In progress

2023-11-14 Update on the project

2nd iteration

3nd iteration

In progress

2023-11-23 Update on the project

2nd iteration

3nd iteration

In progress

2023-11-24 Update on the project

2nd iteration

3nd iteration

In progress

2023-11-30 Update on the project

2nd iteration

3nd iteration

In progress

2023-12-04 Update on the project

2nd iteration

3nd iteration

In progress

2024-01-09 Update on the project

3nd iteration

In progress

2024-01-16 Update on the project

3nd iteration

In progress

2023-02-13: Final update on the issue.

integrations repo

elastic-package repo

elastic-integration-corpus-generator-tool repo

`elastic-package` repo

`elastic-integration-corpus-generator-tool` repo