Closed ruflin closed 9 months ago
Pinging @marc-gr as he was thinking also about this problem in the context of pipelines benchmarking.
For the generation of metrics, I suggest to take strong inspiration from the elastic-integration-corpus-generator-tool tool
I'd like to give some context about the rationale when developing the tool. I had two main goals:
fields.yml
)Satisfying point 1., at the current status, implies the data generated are post-ingest, but nothing prevents us to expand package-spec to include a definition of the pre-ingest schema. I've considered alternatives to this, from "playing backwards" the ingest pipelines to analysing a sample of pre-ingest data and the more I'm convinced that making space in package-spec for such schema is a preferable solution.
I still have to look in the tool spigot by @leehinman. I imagine the pre-ingest schema as a common solution for both metrics and logs, borrowing (as it is or from scratch) the possibility to define cardinality and fuzziness.
Ideally a third "tool", part of elastic package, can be developed to extract cardinality and fuzziness from existing ingested data in a cluster in order to initially feed the two above.
Satisfying point 1., at the current status, implies the data generated are post-ingest, but nothing prevents us to expand package-spec to include a definition of the pre-ingest schema.
I was initially thinking that especially in the metrics case, this should not be too big of an issue. But we are also moving more processing around metrics to ingest pipelines instead of Elastic Agent so the metrics data coming in might look quite a bit different. Something to investigate further.
I don't know if this needs to be part of elastic-package
code base. I'm wondering if it could be something like https://github.com/elastic/stream/ where we supply a config and it is spun up as a docker instance during elastic-package test
.
Ideally it would be nice if the sample generation tool could be run several ways:
1) in elastic-package test
environment
2) as an "input" that elastic-agent
could deploy, so customers could generate data for integrations (demo or testing use case)
3) stand alone, so you could generate rally tracks, local testing, etc.
I don't know if this needs to be part of
elastic-package
code base.
I agree with this in the terms of separated repos that provides both standalone cli commands and packages to be consumed directly from elastic-package
, without the need to wrap them in a separate process
With vpcflow we found a "problem" with the logs that spigot generated. The vpcflow data was fine, but if we made a rally track out of that data, it was missing fields that the filebeat awss3 input adds to the event. Without those fields we couldn't drive the issue.
So I think we really want to have one or more things that can generate logs, and then for both filebeat & logstash have outputs that make rally tracks. That way we don't have to duplicate the fields added by inputs and we don't have to duplicate any user defined processors either, filebeat or logstash will do that like they do in production. This also has another benefit, if we don't have a log generation tool, we can capture real data and make a rally track.
The vpcflow data was fine, but if we made a rally track out of that data, it was missing fields that the filebeat awss3 input adds to the event. So I think we really want to have one or more things that can generate logs, and then for both filebeat & logstash have outputs that make rally tracks.
I guess the missing fields could be generated by https://github.com/elastic/elastic-integration-corpus-generator-tool, since they should be part of "fields.yml", but for _id
the tool at the moment can only generate post-ingestion pipeline documents, but with some jq
post-processing I was able to generate the source vpcflow logs. if we defined some specs for generating back to pre-ingestion pipeline from post one it should not a big deal to incorporate this feature in the tool instead of relying on external post-processing
please let me know if you'd like to work together on the topic :)
I guess the missing fields could be generated by https://github.com/elastic/elastic-integration-corpus-generator-tool, since they should be part of "fields.yml", but for
_id
the fields.yml
files are getting more complete, but there will always be gaps. maybe if we could mix in additional fields that we find are missing that might take care of the gaps.
the tool at the moment can only generate post-ingestion pipeline documents, but with some
jq
post-processing I was able to generate the source vpcflow logs. if we defined some specs for generating back to pre-ingestion pipeline from post one it should not a big deal to incorporate this feature in the tool instead of relying on external post-processing
vpcflow processing is pretty minimal, some other integrations will make putting the original back together from the result very complicated. Others like Cloudtrail would be pretty easy to re-assemble from the results.
please let me know if you'd like to work together on the topic :)
I'm definitely interested in making sure we can generate source documents for all of our integrations.
Even if we get the corpus generator to be nearly perfect, I'm still in favor of providing a turn key way for customers to generate rally tracks from their own data. That way we can run tests with the exact data that is causing the problem.
I'm still in favor of providing a turn key way for customers to generate rally tracks from their own data
Do we need them to generate rally tracks, or is just capturing the raw events enough and we could post-process them into a rally track ourselves? I wonder if adding the ability to tee events to both a file and the actual target output would help here.
I definitely like the idea of having something built into Agent that users can enable in production to give us the exact events they are experiencing issues with. I think this will eliminate a lot of back and forth and wasted time because we could have exactly the data that is causing problems, with possible caveats like having to sanitize out personally identifiable information or credentials.
Do we need them to generate rally tracks, or is just capturing the raw events enough and we could post-process them into a rally track ourselves? I wonder if adding the ability to tee events to both a file and the actual target output would help here.
Doesn't have to be rally, but we should have a simple way of converting to rally. We could probably have a small utility that takes the existing file output and turns that into rally tracks.
I really like the tee idea. We would need to ignore ACKs from secondary output, and make sure that a slow secondary output doesn't slow down the primary output.
Not certain if it offers anything beyond other tools mentioned above, but there also exists Logen for generating logs https://github.com/elastic/logen https://docs.google.com/presentation/d/1I2ZKQo-Rbr18l05Lrp-lnUk-vIM3ZRpnb-xjaPdxShQ/edit#slide=id.p1
Talking of tools, there is also elastic/geneve. We don't have a good summary of what it does though there are some technical docs at https://github.com/elastic/geneve/tree/main/docs and https://github.com/elastic/geneve/tree/main/tests/reports.
The juice is that you describe (in a so called data model) what kind of documents you need and then Geneve will generate as many as you want. A data model can be as simple as "these fields need to be present" to something more complex like "the documents need to have this relation: first doc has some content inprocess.name
, the second doc has process.parent.name
set to whatever was generated for the first one".
Geneve was born to generate documents that would trigger detection rules in the Security app but nothing forbids to describe other kinds of fields/documents relations and use the generated documents for other purposes. Indeed we are currently working with the Analyst Experience team to help them filling their stack with data in a flexible way, this would allow them to use and develop Kibana in ways that are now not easily feasible.
An example of data model is:
sequence by host.id with maxspan=1m
[file where event.type != "deletion" and file.path in ("/System/Library/LaunchDaemons/*", "/Library/LaunchDaemons/*")]
[process where event.type in ("start", "process_started") and process.name == "launchctl" and process.args == "load"]
Example of the four (*) pairs of documents that can be generate:
[{'event': {'type': ['ZFy'], 'category': ['file']}, 'file': {'path': '/System/Library/LaunchDaemons/UyyFjSvILOoOHmx'}, 'host': {'id': 'BnL'}, '@timestamp': 0},
{'event': {'type': ['start'], 'category': ['process']}, 'process': {'name': 'launchctl', 'args': ['load']}, 'host': {'id': 'BnL'}, '@timestamp': 1},
{'event': {'type': ['eOA'], 'category': ['file']}, 'file': {'path': '/System/Library/LaunchDaemons/gaiFqsyzKNyyQ'}, 'host': {'id': 'DpU'}, '@timestamp': 2},
{'event': {'type': ['process_started'], 'category': ['process']}, 'process': {'name': 'launchctl', 'args': ['load']}, 'host': {'id': 'DpU'}, '@timestamp': 3},
{'event': {'type': ['EUD'], 'category': ['file']}, 'file': {'path': '/Library/LaunchDaemons/xVTOLWtimrFgT'}, 'host': {'id': 'msh'}, '@timestamp': 4},
{'event': {'type': ['start'], 'category': ['process']}, 'process': {'name': 'launchctl', 'args': ['load']}, 'host': {'id': 'msh'}, '@timestamp': 5},
{'event': {'type': ['CeL'], 'category': ['file']}, 'file': {'path': '/Library/LaunchDaemons/L'}, 'host': {'id': 'Sjo'}, '@timestamp': 6},
{'event': {'type': ['process_started'], 'category': ['process']}, 'process': {'name': 'launchctl', 'args': ['load']}, 'host': {'id': 'Sjo'}, '@timestamp': 7}]
* Why four pairs? Because the model above has four branches and Geneve can explore all of them individually.
In principle Geneve is a "constraints solver", the data model is indeed a way to describes constraints to the data generation process. Relations between fields/documents are indeed constraints to the otherwise completely open solution space from which geneve draws its "solutions".
When it happens that the solution space is empty, then some conflicting constraints are present (eg: destination.port == 22 and destination.port in (80, 443)
) and no solutions can be found, an error is reported. This is a very useful way to detect queries that cannot possibly be ever satisfied by any dataset.
TBC (in some more suitable place)
In the last few weeks, work has been done on improving the elastic-integration-corpus-generator-tool and trying out multiple template approaches: https://github.com/elastic/elastic-integration-corpus-generator-tool/issues/39 Even though this work is not completed, I'm putting together here a more concret proposal on how all the pieces could work together to build an end-to-end experience around elastic-package. Below I bring up very specific examples but the exact names are less important than the concepts. If we go down the path of implementation, the details will likely change.
When collecting data with Elastic Agent and shipping to Elasticsearch, there are 4 different data schemas. This is important as the data schemas look different and we must align during generation on what data schemas we are talking about. In the diagram below, the schemas A, B, C and D are shown:
flowchart LR
D[Data/Endpoint]
EA[Elastic Agent]
ES[Elasticearch]
IP[Ingest Pipeline]
D -->|Schema A| EA -->|Schema B| IP -->|Schema C| ES -->|Schema D| Query;
message
field and meta information around the host was added to the event.Schema C is the one that is defined in integration packages in the fields.yml
files for each data stream. In the following, our foucs is on Schema B with mentions of schema A.
Schema B is always in JSON format and is the output generated by Elastic Agent. It contains the meta information about the event itself like host or k8s metadata. As schema B is in JSON format, shipping it to Elasticsearch in theory could be done by a curl request taking the json doc as the body of the document. Sending it to the correct data stream, processing would also happen and data is persisted.
The elastic-integration-corpus-generator-tool has ways to generate data based on some config optins and templates. In https://github.com/elastic/elastic-integration-corpus-generator-tool/issues/39 multiple approaches for different templating are discussed. What all have in common are:
What I skipped above is the fields definition for Elaticsearch which is contained in the tool but is not needed in the context of packages as this is Schema C is already defined as part of the package. What is needed in addition is a configuration file for the data generator to deside how much data should be generated, time range etc. In the tool this is currently done through command line parameters.
The assumption is that for a single dataset in an integration package, different scenarios could be generated. Lets take package foo
with dataset bar
as an example. The following files would exist:
foo/data_stream/bar/_dev/data_generation/config.yml
foo/data_stream/bar/_dev/data_generation/template1.tmpl
foo/data_stream/bar/_dev/data_generation/template2.tmpl
foo/data_stream/bar/_dev/data_generation/template1-config.yml
foo/data_stream/bar/_dev/data_generation/template2-config.yml
In the example above, 2 templates each with a config file are used. The .tmpl
file contains the json template for the event and template1-config.yml
contains the definition of the fields for the template. It would be possible to have just one definition for each template.
The config.yml
contains a list of scenarios that should be generated. It could look similar to:
data_generation:
- name: short-sample
timerange: 2d
events: 1000
template: template1
# See spigot for more options https://github.com/leehinman/spigot
output: elasticsearch
- name: middle-sample
timerange: 10d
events: 10000
template: template1
output: elasticsearch
- name: large-sample
timerange: 2d
events: 1000000
template: template2
output: rally
More config options could be added. The goal is to show that multiple data generations can be configured. Having all the setup done, elastic-package can be used to generate the data:
elastic-package data generate --package=bar --dataset=foo --name=large-sample
The parameters are optional. If the command is run inside a package, it would apply it to all dataset and all tasks by default or one can be selected. As can be seen in the above example, an output format can also be specified. The data can be stored in rally track format or sent to Elasticsearch directly.
Behind the scenes, elastic-integration-corpus-generator-tool is used to generated the events out of the templates and spigot to generate the relevant outputs.
The generation of schema A would look very similar. But to ship schema A to Elasticsearch, a running Elastic Agent is needed. Similar to schema B, package-spec could contain config options on how to generate it. It would require in addition some logic around how schema A is collected to run an Elastic Agent for collection of it. There is a good chance, all these configs already exist in the data stream and can be used.
Generation of schema B I see as only second priority of this project.
One of the goals of the generation of schema B
is to be able to create rally track compatible data. To create a full rally track, it is required to also export templates, ingest pipelines from the package. elastic-package already has an export
command to make use of this.
In an ideal scenarios, a user could run elastic-package benchmark --dataset=foo
and behind the scenes, data would be generated, rally track created, setup is done with esbench, pipelines and templates are loaded and data is ingested to Elasticsearch through the pipeline and measurements are provided on the performance for this run. At first, there might be some additional manual steps required.
Some thoughts on the benchmark topics, but don't want to sidetrack the discussion, maybe we can sync offline.
In an ideal scenarios, a user could run elastic-package benchmark --dataset=foo and behind the scenes, data would be generated, rally track created, setup is done with esbench, pipelines and templates are loaded and data is ingested to Elasticsearch through the pipeline and measurements are provided on the performance for this run. At first, there might be some additional manual steps required.
It would be nice to have elastic-package
orchestrate whatever is required to execute a benchmark in the context of the integration itself, meaning start agent, configure agent, configure remote ES (index-template, pipelines, ....), start the agent, monitor the agent, tear down the agent, collect results. Having elastic-package` focused only on the component and not orchestrating a more complicated full environment would be a good starting point.
Some thoughts on the benchmark topics, but don't want to sidetrack the discussion, maybe we can sync offline.
In an ideal scenarios, a user could run elastic-package benchmark --dataset=foo and behind the scenes, data would be generated, rally track created, setup is done with esbench, pipelines and templates are loaded and data is ingested to Elasticsearch through the pipeline and measurements are provided on the performance for this run. At first, there might be some additional manual steps required.
It would be nice to have
elastic-package
orchestrate whatever is required to execute a benchmark in the context of the integration itself, meaning start agent, configure agent, configure remote ES (index-template, pipelines, ....), start the agent, monitor the agent, tear down the agent, collect results. Having elastic-package` focused only on the component and not orchestrating a more complicated full environment would be a good starting point.
In these lines, I drafted this for my own reference, so is not intended to be complete but just a bigger picture of how this would be for cases such as the Schema A scenarios mentioned above
So ideally we would have all required things to run benchmarks self contained in elastic-package and as part of the integration definitions, and esbench is going to be more a description for a more permanent benchmark setup as it is today for other cases if I am not mistaken.
Following up on what @marc-gr wrote, we can also consider that data generation and data usage may not necessarily happen consequently and keep this decoupling in mind. The benefit would be that generator tools would be able to generate and store data in a generic storage and tool that leverages those data would be able to "replay" those data without the generation step, which may be compute intensive (thus either being bounded by compute resources or requiring extensive resources to be run at the desired scale). (As discussed with @ruflin some tools, like https://github.com/elastic/rally, already supports loading from S3).
can also consider that data generation and data usage may not necessarily happen consequently and keep this decoupling in mind
++, we should not only consider it but make sure it is decouple. I expect by default when data generation is used, the data is written to disk (in some format, maybe rally format?). That doesn't mean there can be eventually commands that bring it all together in one flow.
Status update:
Schema X
dataNext steps (priority order to be defined):
_dev/data_generation
/schema generation proposal from @ruflin elastic-package
command for generating data based on a template ("hook" to genlib. NewGeneratorWithCustomTemplate) elastic-package
command for benchmark (as for Rally track generation proposal from @ruflin Thank you @aspacca !
I have an additional question about elastic-integration-corpus-generator-tool
: could it be used to resolve an issue where in a fresh cluster install, before the first alert is generated, the .alerts-security.alerts-default]
index doesn't exist? For example, an integration package that has a transform that reads from that index, will error with no such index [.alerts-security.alerts-default];
until we go in and make sure an alert is created. Example, the CI job failing on this PR
hi @susan-shu-c
I have an additional question about
elastic-integration-corpus-generator-tool
: could it be used to resolve an issue where in a fresh cluster install, before the first alert is generated, the.alerts-security.alerts-default]
index doesn't exist? For example, an integration package that has a transform that reads from that index, will error withno such index [.alerts-security.alerts-default];
elastic-integration-corpus-generator-tool
just generates data on the machine you run it on: nothing prevents you to generate a template that contains the payload of a bulk request for .alerts-security.alerts-default
, still you have to send it to the cluster. you have to add this in your CI/wherever as preemptive steps.
unless you need to generate a lot of data with random content, there's probably no need to make use of elastic-integration-corpus-generator-tool
, since what would solve the issue in CI (as long as I understand the issue) is ingesting documents in .alerts-security.alerts-default
.
on the broader scope of having an elastic-package benchmark
command your case is anyway something that has to be addressed, because potentially you should be able to run something like elastic-package benchmark host_risk_score
I'm not familiar with security integrations: I see for example that hot_risk_score
does not have a ./data_stream/billing/fields
folder, that's where elastic-integration-corpus-generator-tool
gets the information in order to generate "Schema C" data
cc @leehinman
Updates on the work being done. The first iteration is in progress, we are focusing on adding templates for aws.ec2_logs
, aws.ec2_metrics
, aws.billing
, aws.sqs
and integrate generation with elastic-package
.
Week 5 (Jan 30 - Feb 2): PRs:
Ongoing discussions:
We are still in the first iteration, with the same goals. There have been slow progresses in the last week due to SDH duties.
Week 6 (Feb 6-10): PR opened:
Week 7 (Feb 13-17): PR Merged:
Week still in progress, we expect to open a new PR with aws.ec2_logs
template and merge aws.billing
template.
Week 9 (Feb 27 - Mar 2):
New PRs:
Merged PRs:
First iteration is still ongoing. Progress have been made in elastic-package
and templates.
Week 10 (Mar 6 - 10):
Merged:
We are only missing one template to close the first iteration.
Discussions:
New:
Merged: none
With today we conclude the first iteration. All planned templates are available.
Merged:
Releases: v0.5.0
Related:
During the first iteration we created schema B templates for aws.ec2_logs
, aws.ec2_metrics
, aws.billing
, aws.sqs
and k8s
Library refactoring
Adding benchmark rally command
elastic-package
(PR needs to be created, est. 1 day)Regarding
The package-spec PR has been merged, we are now waiting for a new release, before doing it we have this pending PR (under review)
All the PR dependencies have been merged, @aspacca could you launch the next steps please ?
@aspacca Could we share an update of where we are today and what are the next and remaining steps please ?
size
param to totEvents
param in generator's API [~PR~]elastic-package
according to new generator's API [~package-spec
PR~, elastic-package
issue]elastic-package
[issue]schema-b
content from the generator's repo directly to elastic-package
[issue]size
param to totEvents
param in generator's API [~PR~]elastic-package
according to new generator's API [~package-spec
PR~, elastic-package
issue]. no work left, blocked by package-spec@v3
elastic-package
[issue]. est. 5 days of coding, with potential external dependenciesschema-b
content from the generator's repo directly to elastic-package
[issue]. est. 1 day of coding for each dataset's integration. no external dependenciesno work left, blocked by package-spec@v3
@aspacca Can you share some more details on what part of v3 this is blocked on? Anything we can do on our end to get this unblocked?
Duplicating schemas
Will the schemas change in some ways when moving to elastic-package? Context I'm asking: After the moving 1-2 datasets as examplse, could the teams themself move the assets over?
Can you share some more details on what part of v3 this is blocked on? Anything we can do on our end to get this unblocked?
we dropped a deprecated field and renamed the new one, taking the occasion of the breaking change in v3 (no one but for package-spec
and tests in elastic-package
is using yet the spec for system/rally benchmark anyway, so we didn't need to support a migration path).
the PR is blocked waiting for the spec changes described above will be merged in v3 in order to pass CI
Duplicating schemas Will the schemas change in some ways when moving to elastic-package? Context I'm asking: After the moving 1-2 datasets as examplse, could the teams themself move the assets over?
no, they won't change, but of reviewing flattened objects notation that I personally forgot to consider when creating the schemas in the first place. the teams would be then independent to duplicate/migrate the schemas.
size
param to totEvents
param in generator's API [~PR~]elastic-package
according to new generator's API [~package-spec
PR #1~, ~package-spec
PR #2~, elastic-package
issue], elastic-package
PR]. no work left, waiting for CI to be green with the release of package-spec@v3
elastic-package
[issue]. est. 5 days of coding, with potential external dependenciesschema-b
content from the generator's repo directly to elastic-package
[issue]. est. 1 day of coding for each dataset's integration. no external dependenciessize
param to totEvents
param in generator's API [~PR~]elastic-package
according to new generator's API [~package-spec
PR~, ~elastic-package
issue~, ~elastic-package
PR~].elastic-package
[issue, PR]. PR waiting for review.schema-b
content from the generator's repo directly to elastic-package
[issue]. est. 1 day of coding for each dataset's integration. no external dependenciessize
param to totEvents
param in generator's API [~PR~]elastic-package
according to new generator's API [~package-spec
PR~, ~elastic-package
issue~, ~elastic-package
PR~].elastic-package
[issue, PR]. PR in review.schema-b
content from the generator's repo directly to elastic-package
[issue]. est. 1 day of coding for each dataset's integration. no external dependenciessize
param to totEvents
param in generator's API [~PR~]elastic-package
according to new generator's API [~package-spec
PR~, ~elastic-package
issue~, ~elastic-package
PR~].elastic-package
[~issue~, ~PR~].schema-b
content from the generator's repo directly to elastic-package
[issue]. est. 1 day of coding for each dataset's integration. 2/6 throughschema-b
content from the generator's repo directly to elastic-package
[issue]. est. 1 day of coding for each dataset's integration. 6/6 through, waiting for review
benchmark stream
command [issue]. est. 5 daysbenchmark rally
command [issue, PR]. waiting for review(now - period) + ((period / tot events) * nth event)
[PR]schema-b
content from the generator's repo directly to elastic-package
[issue]. est. 1 day of coding for each dataset's integration. 3/6 merged, 3/6 waiting for review
benchmark stream
command [issue]. est. 5 daysbenchmark rally
command [issue, PR]. ready to merge(now - period) + ((period / tot events) * nth event)
[PR]range.from/to
for type date in the generator tool [PR]. waiting for reviewschema-b
content from the generator's repo directly to elastic-package
[issue]. est. 1 day of coding for each dataset's integration. 5/6 merged, 1/6 waiting for review
benchmark stream
command [issue]. est. 5 daysbenchmark rally
command [issue, PR].(now - period) + ((period / tot events) * nth event)
[PR]range.from/to
for type date in the generator tool [PR]. waiting for reviewelastic-package benchmark rally
: support install package from registry and local corpus. est. 1 dayschema-b
content from the generator's repo directly to elastic-package
[issue]. est. 1 day of coding for each dataset's integration. 5/6 merged, 1/6 waiting for review
benchmark stream
command [issue]. est. 5 daysbenchmark rally
command [issue, PR].(now - period) + ((period / tot events) * nth event)
[PR]range.from/to
for type date in the generator tool [PR]. waiting for reviewelastic-package benchmark rally
: support install package from registry and local corpus. [PR]. waiting for reviewschema-b
content from the generator's repo directly to elastic-package
[issue].
benchmark stream
command [issue. PR]. waiting for reviewbenchmark rally
command [issue, PR].(now - period) + ((period / tot events) * nth event)
[PR]range.from/to
for type date in the generator tool [PR]elastic-package benchmark rally
: support install package from registry and local corpus. [PR]. waiting for reviewbenchmark rally
command[issue]benchmark stream
command [issue. PR]. waiting for reviewbenchmark rally
command [issue, PR].(now - period) + ((period / tot events) * nth event)
[PR]range.from/to
for type date in the generator tool [PR]elastic-package benchmark rally
: support install package from registry and local corpus. [PR]. waiting for reviewbenchmark rally
command[issue]benchmark stream
command [issue. PR]. waiting for reviewbenchmark rally
command [issue, PR].(now - period) + ((period / tot events) * nth event)
[PR]range.from/to
for type date in the generator tool [PR]elastic-package benchmark rally
: support install package from registry and local corpus. [PR]. waiting for reviewbenchmark rally
command[issue]The assets (templates, fields.yml
and config.yml
) were generated for the following datasets in the integrations repo:
Continuous refinement is ongoing on some existing assets, new assets for new datasets are continuously added
elastic-package
repoelastic-package benchmark rally
in order to generate and run a rally track from the root folder of an integration for a specific dataset. Several options are provided, like only generating the rally track with the related corpus, persisting the rally track and the related corpus or replaying an existing generated rally track with the related corpuselastic-package benchmark stream
in order to streaming ingestion to an ES cluster from the root folder of an integration for one or multiple datasets at once. An option to backfill events for a configurable amount of time before having run the command is provided.Further enhancements are already planned, like decoupling the location where the commands need to be launched from (root folder of an integration), improving automation experience according to relevant audiences, and an internal refactoring of the existing duplicated code among the other things, but not limited to those.
elastic-integration-corpus-generator-tool
reposeed
for the rand package and time to be used as Time.Now()
in order to generate reproducible contentrange.from/to
for date fields in order to set time bounds in the generated values (similar to numeric range.min/max
)@timestamp
field Adding support for counter numeric field is already ongoing and a big refactoring of configuration around cardinality
is planned and already designed and it will be the next first future implementation. This refactoring is a breaking change we deemed it is necessary and its highest priority is given exactly by the fact the if we proceed with it now the impact will be fairly reduced.
We identified the following areas of ownership:
When building integration packages, sample data is important to develop ingest pipeline and build dashboards. Unfortunately in most cases, real sample data is limited and often tricky to produce. This issues proposes a tool as part of elastic-package that can generate and load sample data.
Important: The following is only an initial proposal to better explain the problem and share existing ideas. A proper design is still required.
Why part of elastic-package
Generating sample data is not a new problem and there are several tools which already provide partial solutions to this. A tool to generate sample data in elastic-package is needed to make it available in a simple way to each package developers. How sample data should look and be generated becomes part of the package spec. Like this, someone building a package directly also gets the possibility of generating sample data and use it as part of the developer experience.
Data generation - metrics / logs
For the data generation, two different types of data exist. Metrics and traces are mostly already in the format that will be ingested into Elasticsearch and require very little processing. Logs on the other hand often come as raw messages and require ingest pipelines or runtime fields to structure the data. The goal is that the tool can generate both types of data but it can happen in iterations.
Metrics generation
For the generation of metrics, I suggest to take strong inspiration from the elastic-integration-corpus-generator-tool tool built by @aspacca. Instead of having to build separate config files, the config params for each field would be directly in the
fields.yml
of each data stream. The definition could look similar to the following:The exact syntax for each field type needs definition.
Logs generation
For logs generation, inspiration can be taken from the tool spigot by @leehinman. Ideally we could simplify this by allowing users to specify the message patterns something like
{@timestamp} {source.ip}
and the specify for these fields what the values should be. Then the tool would take over the generation of sample logs.Important is that the log generation outputs message fields pre ingest pipeline.
Generated data format
The proposed data structure generated by the tool is the one used by esrally. It contains 1 JSON doc per line with all the fields inside. This makes it simple to just deliver the data to Elasticsearch and makes it possible to potentially reuse some of this generated data with rally tracks.
Non goals
A non goal of the data generation on loading of data is to replace rally. Rally measures the exact performance and builds reproducible benchmarks. When generating and loading data with elastic-package it is about testing ingest pipelines, testing dashboards and test queries on larger sets of data in an easy way. The focus is on package development.
Another non goal is to generate events that are related to each other. For some solutions it is important that if a host.name shows up other parts of the data contain the same host.name to be able to browse through the solution. This might be added at a later point but is not part of the scope.
Sample data storage
As the sample data can always be generated on the fly, it is not required to store it. If some of the sample data sets should be stored for later use, package-spec should provide a schema to reference sample datasets.
Command line
Command line arguments must be available to generate sample data for a dataset or a package and load it into Elasticsearch. Ideally package-spec allows to store some config files around which data sets can be generated so a package developer can share these configs as part of the package.
Initial packages to start with
I recommend to pick 2 initial packages to start with around the data generation. As k8s and AWS are both more complex package that also generate lots of data, this could be a good start focusing on the metrics part.
Future ideas