elastic / elastic-integration-corpus-generator-tool

Command line tool used for generating events corpus dynamically given a specific integration
Other
21 stars 12 forks source link

remove coupling between fields with the same cardinality #128

Open tommyers-elastic opened 7 months ago

tommyers-elastic commented 7 months ago

To illustrate this problem, consider generating 15 events with the folllowing configuration:

fields:
  - name: a
    range:
      min: 0
      max: 50
    cardinality: 5
  - name: b
    range:
      min: 0
      max: 100
    cardinality: 5  

a is a number between 0-50 and in the generated events there are 5 unique values of a. b is a number between 0-100 and in the generated events there are 5 unique values of b.

In this configuration there is no explicit coupling between fields a and b. However when this is run, the output is as follows:


{
    "a-b": 10-51
}

{
    "a-b": 21-37
}

{
    "a-b": 20-58
}

{
    "a-b": 48-16
}

{
    "a-b": 49-84
}

{
    "a-b": 10-51
}

{
    "a-b": 21-37
}

{
    "a-b": 20-58
}

{
    "a-b": 48-16
}

{
    "a-b": 49-84
}

{
    "a-b": 10-51
}

{
    "a-b": 21-37
}

{
    "a-b": 20-58
}

{
    "a-b": 48-16
}

{
    "a-b": 49-84
}

Notice how there are 5 unique documents here, repeated 3 times. The fact that the fields have the same cardinality causes them to be coupled.

This behaviour is confusing, and can cause unwanted repetition in the generated data.

The correct behaviour can be observed with enum types, which also have well-defined cardinality (the number of enum values).

fields:
  - name: region
    enum: ['NASA', 'APAC', 'EMEA']
  - name: team
    enum: ['A', 'B', 'C']

Note that in this configuration both fields have a cardinality of 3. In the generated data there is no coupling. Here are 9 generated data points:


{
    "sales-team": EMEA-A
}

{
    "sales-team": EMEA-C
}

{
    "sales-team": APAC-A
}

{
    "sales-team": APAC-C
}

{
    "sales-team": APAC-A
}

{
    "sales-team": EMEA-B
}

{
    "sales-team": NASA-C
}

{
    "sales-team": APAC-C
}

{
    "sales-team": NASA-C
}

Another strange behaviour is that if I explicitly write the cardinality values in for these fields (3 and 3 respectively), one would expect it to have no effect, since the cardinality is already 3, but doing this causes only 3 unique values of sales-team in the output, repeated over and over.

fields:
  - name: region
    enum: ['NASA', 'APAC', 'EMEA']
    cardinality: 3
  - name: team
    enum: ['A', 'B', 'C']
    cardinality: 3

->

{
    "sales-team": EMEA-A
}

{
    "sales-team": APAC-B
}

{
    "sales-team": NASA-C
}

{
    "sales-team": EMEA-A
}

{
    "sales-team": APAC-B
}

{
    "sales-team": NASA-C
}

{
    "sales-team": EMEA-A
}

{
    "sales-team": APAC-B
}

{
    "sales-team": NASA-C
}

This implicit coupling of values with the same cardinality should be removed, and replaced with a more explicit way to enable coupling between values (which is often required).

aspacca commented 7 months ago

This implicit coupling of values with the same cardinality should be removed, and replaced with a more explicit way to enable coupling between values (which is often required).

@tommyers-elastic

First of all, I'd like to be sure we share some same vocabulary :)

The cardinality of a field refers to the number of distinct values that it can have.. What we define above as "cardinality of a field" is something different from the cardinality config entry The latter defines the number of different values actually generated for a specific field.

I guess that having "cardinality" defining two different things in the context of the corpus generator is what lead to your confusion when making the example of two enum fields with and without cardinality config entry.

Is it more clear what to expect given the clarification above?

The implicit coupling of fields with the same value for their cardinality config entry, it's a consequence of the current implementation. When a field has a cardinality config entry defined, different values are generated sequentially at each iteration, as many as the value defined for cardinality. Once this limit is reached, no more values are generated and instead, since the next iteration, one of the previous values generated will be used, one after another, in the order they were generated. Picking instead a different random value from the pool of generated ones, until no one is left and starting again the cycle will solve the implicit coupling. I will start with changing the implementation this way and evaluate the impact of performance.

Another step is to identify the specific behaviour of "coupling" between fields. Given the current implementation of cardinality config entry and how we use the config entry to create a relationship between fields, "coupling" has at the moment the specific meaning of repeating the same tuple of values for two or more fields, each N events generated, at every Nth event (where basically N is the value of the cardinality config entry).

In more details, "coupling" fields through cardinality config entry, as per the above definition, does not imply using the same cardinality value for every fields. In general, assuming 10 events generated, using the same cardinality value for every fields, produces 10 events, with 10 different tuples, where the all Nth elements in all the generated tuples, have no repetition of the same value. Since in a generated tuple the Nth element carries the value of the Nth field, the above produces a 1:1 ratio between each field, related to the all values generated across all the tuples. Different ratios can be achieved with different cardinality config entry values reflecting the ration to be achieved: ie, if we want to generate 10 events, with 10 different values for fieldA, and 5 different values for fieldB (ratio 1:2), we define cardinality: 10 for fieldA, and cardinality: 5 for fieldB.

Is the "coupling" behaviour you have in mind any different from the above? Could you please specify the exact behaviour you would like to achieve?

In the case the behaviour is the same, is it correct that the issue is more about expressing this behaviour in a different, more explicit way? Do you have any suggestion on the way it could be expressed better? How does the definition of the number of different values actually generated for a specific field (cardinality entry config) fit the expression of the coupling behaviour? Should it be possible to couple to fields without defining the number of different values actually generated for a specific field?

Thanks :)

aspacca commented 7 months ago

@tommyers-elastic

let me add some more context :)

cardinality config entry is currently the way to express the following: "generate events with a total of 10 different k8s namespaces and 50 different services, so that every single different namespace has always the same 5 different services, keeping an even distribution"

we are asking for multiple things at once:

  1. the number of different values to generate for namespace and service
  2. the ratio between every single namespace value and multiple service values
  3. the fact that the above ratio is respected for each different single namespace value

From my point of view only 2. is "not negotiable": meaning that no matter how much we have to stretch the cover, in case all three can not be achieved at the same time, 2. must remain.

I'm quite confident we could totally give up on 3.: meaning that it's something that we could just not support.

I'm not quite sure on 1.: at the beginning the question was slightly different and more dynamic, something like "the percentage of events with a single different value for namespace and service". for a concrete example: I want each 1/10 (ie: 10%) of the events to have a different namespace and each 1/5 (ie: 20%) of this 10% to have a different service. when generating 100 events this will produce 10 namespaces and 50 services, when generating 200 events this will produce 20 namespaces and 100 services.

I have the impression that this initial solution suits better what we want to achieve, especially in terms on how to express it. simply because we reduce 1. and 2. to the same concept: a ratio. 1. being a ratio between total number of events and a single different values for a field, and 2 a ratio between single different values of multiple fields.

what are your thought on that?

cc @ruflin , @gizas : I'd like to have your opinion as well.

especially @gizas's one, since you have a very concrete case with the k8s container and pod dataset. how much relevant is for you to generate exactly 10 namespaces and 50 services, vs each 10% of total events with a different namespace, and 20% of each namespace with a different service?

if we all agree with the above, I'd say the best way is to remove the cardinality config entry at the field level, and have a dedicated section in the config file for defining the ratios between fields and between each single field and the total number of events.

aspacca commented 7 months ago

if we all agree with the above, I'd say the best way is to remove the cardinality config entry at the field level, and have a dedicated section in the config file for defining the ratios between fields and between each single field and the total number of events.

last note on the above: this implies that we have a known total number of events. that's not always the case, see for example the elastic-package benchmark stream command. in this scenario we can still apply ratios/coupling configuration, with just the caveat that, missing a total number of events, we will respect the configuration according to an arbitrary amount of events (that can be as well dynamically calculated as the minimum amount of events required given a specific configuration)

ruflin commented 7 months ago

Thanks @aspacca for the details. It explain why the current behaviour exists / works the way it is.

Picking instead a different random value from the pool of generated ones, until no one is left and starting again the cycle will solve the implicit coupling. I will start with changing the implementation this way and evaluate the impact of performance.

That was for me the surprising part as I had (falsely) assume that with a cardinality of 5, each time a value is picked randomly.

For 1. / 2. / 3. above, there is always the option to add more config options to define some coupling and ordering of fields.

gizas commented 7 months ago

how much relevant is for you to generate exactly 10 namespaces and 50 services, vs each 10% of total events with a different namespace, and 20% of each namespace with a different service?

Until now the only need we have was to generate fix amount resources like 10 namespaces and 50 services. But I see your point for the rest examples.

I guess that what @aspacca says :this implies that we have a known total number of events. is a key factor for the problem. For me the fact that cardinality replays the same fields with same order was not an important issue, in fact in some cases in k8s we expect events to come in specific order.

Some ideas when reading the above: I was trying in the past to use sprig slice functions in combination with {{range pipeline}} T1 {{end}} as described here. I guess this can cover your scenarios is not it? @aspacca this templating will work I guess

Additionally the object can be a solution to group fields with specific ratio, like in the example below. What do you think?:

 - name: aws.dimensions.*
    object_keys:
      - TableName
      - Operation
  - name: aws.dimensions.TableName
    enum: ["table1", "table2"]
  - name: aws.dimensions.Operation
    cardinality: 2
aspacca commented 7 months ago

cardinality config entry is currently the way to express the following: "generate events with a total of 10 different k8s namespaces and 50 different services, so that every single different namespace has always the same 5 different services, keeping an even distribution"

you can use cardinality config entry to express as well something like: "generate events with a total of 101 different hostnames and 99 different fetch_count values, so that we have a combination of the 101 different hostnames and the 99 different fetch_count values, once all the possible combinations are generated repeat them in the same order"

again, we are asking for multiple things at once:

  1. the number of different values to generate for hostname and fetch_count
  2. the fact that we want to generate a combination of hostname and fetch_count values
  3. the fact that all combinations will be repeated in the same order

both for this scenario and for the one mentioned before, each 3. are basically just a consequence of the current implementation. I propose to not care about them. "values" might be or not be repeated in the same order, or having or not an even distribution.

for both scenarios, the number of different values actually generated (1.) is what cardinality config entry should only have an impact on. more specifically we should assure that any specific cardinality value we have for any fields, no "coupling" should happen and no ordered repetition should happen.

"coupling" must be expressed in a different, independent way, without need for cardinality.

still, @tommyers-elastic , @gizas , please give me your definition of "coupling", or better: what do you need exactly? :)

@gizas from what I know, in your case is being able to define how many different namespace must be generated, and for each namespace how many different services and/or pods must be generated, and that the same service/pod must not be generated in different namespaces. correct?

does it boil down to the same for you, @tommyers-elastic ?

aspacca commented 7 months ago

I was trying in the past to use sprig slice functions in combination with {{range pipeline}} T1 {{end}} as described here. I guess this can cover your scenarios is not it? @aspacca this templating will work I guess

not sure I get what you mean :) can you make an example of a template with slice functions and range pipeline, and what it's the outcome? thanks!

tommyers-elastic commented 7 months ago

i think we should not overload the meaning of 'cardinality' from what an elastic developer would assume it means, i.e. the number of unique elements in a set. if we are having to make distinctions about 'cardinality of a field' vs 'cardinality configuration' to the people in this thread, let alone the rest of the company, then we need to simplify.

i think implementing random selection to remove repetition from fields with the same cardinality is a good idea.

as far as I can tell, the only other concrete use-case mentioned here relating to 'coupling' is:

"generate events with a total of 10 different k8s namespaces and 50 different services, so that every single different namespace has always the same 5 different services, keeping an even distribution"

in order to solve this, how about an additional configuration option, which specifies a 'parent' field. in this case, the cardinality config would apply to each 'instance' (i.e. value) of the parent field. so here, each time a namespace-id value is generated, 5 unique values of service-id are generated. a downside to this approach is that there may be overlap between values generated for each parent instance. this can be mitigated by selecting values from a large range though, as in the example. i would avoid 'fixing' this overlap in the implementation, because it adds caveats into the meaning of 'cardinality' which again could cause confusion. this overlap could also be solved using a type like uuid - do we have this?

fields:
  - name: namespace-id
    range:
      min: 0
      max: 4294967295
    cardinality: 10
  - name: service-id
    parent: namespace-id
    range:
      min: 0
      max: 4294967295
    cardinality: 5

WDYT?

aspacca commented 7 months ago

i think we should not overload the meaning of 'cardinality' from what an elastic developer would assume it means, i.e. the number of unique elements in a set. if we are having to make distinctions about 'cardinality of a field' vs 'cardinality configuration' to the people in this thread, let alone the rest of the company, then we need to simplify.

the distinction is required not because the meaning of cardinality is overloaded, but because the sets we refer to with "the number of unique elements in a set" are different :)

i think implementing random selection to remove repetition from fields with the same cardinality is a good idea.

oki on that

as far as I can tell, the only other concrete use-case mentioned here relating to 'coupling' is: "generate events with a total of 10 different k8s namespaces and 50 different services, so that every single different namespace has always the same 5 different services, keeping an even distribution" in order to solve this, how about an additional configuration option, which specifies a 'parent' field. in this case, the cardinality config would apply to each 'instance' (i.e. value) of the parent field. so here, each time a namespace-id value is generated, 5 unique values of service-id are generated.

please, see https://github.com/elastic/integrations/blob/main/packages/aws/_dev/benchmark/rally/ec2metrics-benchmark/config.yml#L4-L12 for a different concrete use-case:

you have multiple fields with cardinality, and you want those field to be linked (meaning that for field1/valueX you want always field2/valueY)

do you think that a different use-case should have a different configuration?

so for the use-case you mentioned, we go for parent, and for the one I've mentioned we go for something else?

a downside to this approach is that there may be overlap between values generated for each parent instance. this can be mitigated by selecting values from a large range though, as in the example. i would avoid 'fixing' this overlap in the implementation, because it adds caveats into the meaning of 'cardinality' which again could cause confusion. this overlap could also be solved using a type like uuid - do we have this?

I would say the meaning of "cardinality" stays the same, as well the sets it refers to. it's just that assuring that there's no overlap across all sets is an hidden property that's not evident from the configuration itself.

still it is often required to have this property, how do you suggest to express it in order to not cause confusion? is it not enough to document the property in order to not cause confusion?

tommyers-elastic commented 7 months ago

in this second use case, do the two fields always require the same cardinality?