elastic / elastic-integration-corpus-generator-tool

Command line tool used for generating events corpus dynamically given a specific integration
Other
21 stars 12 forks source link

add formatting pattern support #151

Open gpop63 opened 2 months ago

gpop63 commented 2 months ago

Overview

This PR introduces the capability to generate field values in a specific format.

A set of standard pattern generators are added: ipv4, ipv6, port and string. Regex is used to identify formatting patterns in the field value, which must conform to the {generator} format.

Example:

- name: hostIP
  cardinality: 25
  formatting_pattern: "{ipv4}:{port}"

Test with actual config

configs.yml

```yaml fields: - name: cloud.region enum: ["us-east-1", "us-east-2", "us-west-1", "us-west-2", "ap-south-1", "ap-northeast-3", "ap-northeast-2", "ap-southeast-1", "ap-southeast-2", "ap-northeast-1", "ca-central-1", "eu-central-1", "eu-west-1", "eu-west-2", "eu-west-3", "eu-north-1", "sa-east-1", "af-south-1", "ap-east-1", "ap-south-2", "ap-southeast-3", "eu-south-2", "eu-central-2", "me-south-1", "me-central-1"] cardinality: 25 - name: cloud.account.id value: "123456789" - name: cloud.account.name value: sample-account - name: aws.billing.currency value: "USD" - name: aws.billing.ServiceName # NOTE: When empty the data refers to estimated charged for the entire account. We cannot reproduce the content (as it's a sum of previous data) but we want to provide the case. enum: ["", "AWSCloudTrail", "AWSCodeArtifact", "AWSConfig", "AWSCostExplorer", "AWSDataTransfer", "AWSELB", "AWSLambda", "AWSMarketplace", "AWSQueueService", "AWSSecretsManager", "AWSServiceCatalog", "AWSSystemsManager", "AWSXRay", "AmazonApiGateway", "AmazonCloudWatch", "AmazonCognito", "AmazonDynamoDB", "AmazonEC2", "AmazonECR", "AmazonEKS", "AmazonKinesis", "AmazonKinesisFirehose", "AmazonRDS", "AmazonRedshift", "AmazonRoute53", "AmazonS3", "AmazonSNS", "AmazonVPC", "awskms"] - name: agent.id value: "12f376ef-5186-4e8b-a175-70f1140a8f30" - name: agent.ephemeral_id value: "5fd278ce-2a12-4a09-a125-0c5b39aa69e3" - name: agent.name value: "host.local" - name: metricset.period value: 86400 - name: aws.billing.group_definition.key # NOTE: repeated values are needed to produce 10% cases with "" value enum: ["", "AZ", "INSTANCE_TYPE", "SERVICE", "LINKED_ACCOUNT", "AZ", "INSTANCE_TYPE", "SERVICE", "LINKED_ACCOUNT"] - name: event.duration range: min: 1 max: 1000 - name: aws.billing.EstimatedCharges cardinality: 25 fuzziness: 0.2 - name: aws.billing.AmortizedCost.amount cardinality: 25 fuzziness: 0.2 - name: aws.billing.BlendedCost.amount cardinality: 25 fuzziness: 0.2 - name: aws.billing.NormalizedUsageAmount.amount cardinality: 25 fuzziness: 0.2 - name: aws.billing.UnblendedCost.amount cardinality: 25 fuzziness: 0.2 - name: aws.billing.UsageQuantity.amount cardinality: 25 fuzziness: 0.2 - name: aws.billing.group_definition.type value: "DIMENSION" - name: aws.billing.group_by.INSTANCE_TYPE enum: ["NoInstanceType", "a1.large", "c5.2xlarge", "c5.xlarge", "c6i.2xlarge", "db.r6g.2xlarge", "db.t2.micro", "dc2.large", "m5.large", "t1.micro", "t2.medium", "t2.micro", "t2.small", "t2.xlarge", "t3.2xlarge", "t3.medium", "t3.xlarge","t3.xlarge"] - name: aws.billing.group_by.SERVICE enum: ["Amazon Simple Storage Service", "Amazon Elastic Compute Cloud - Compute", "EC2 - Other", "Amazon Kinesis", "Amazon Relational Database Service", "Amazon Elastic Load Balancing", "AmazonCloudWatch", "AWS CloudTrail", "AWS Config", "AWS Key Management Service", "AWS Lambda", "AWS Secrets Manager", "AWS Service Catalog", "Amazon API Gateway", "Amazon DynamoDB", "Amazon EC2 Container Registry (ECR)", "Amazon Elastic Container Service for Kubernetes", "Amazon Kinesis Firehose", "Amazon Redshift", "Amazon Simple Notification Service", "Amazon Simple Queue Service", "Amazon Virtual Private Cloud"] - name: path cardinality: 25 formatting_pattern: "/home/{string}/{string}/{string}/{string}" - name: hostIP cardinality: 25 formatting_pattern: "{ipv4}:{port}" ```

fields.yml

```yaml - name: timestamp type: date - name: path type: keyword - name: hostIP type: keyword - name: cloud.region type: keyword - name: cloud.account.id type: keyword - name: cloud.account.name type: keyword - name: event.duration type: long - name: metricset.period type: long - name: aws.billing.currency type: keyword - name: aws.billing.EstimatedCharges type: float # positive - name: aws.billing.ServiceName type: keyword - name: aws.billing.AmortizedCost.amount type: float # positive - name: aws.billing.BlendedCost.amount type: float # positive - name: aws.billing.NormalizedUsageAmount.amount type: integer # positive - name: aws.billing.UnblendedCost.amount type: float # positive - name: aws.billing.UsageQuantity.amount type: integer # positive - name: agent.id type: keyword - name: agent.name type: keyword - name: agent.ephemeral_id type: keyword example: 12f376ef-5186-4e8b-a175-70f1140a8f30 - name: aws.billing.group_definition.key type: keyword - name: aws.billing.start_date type: date - name: aws.billing.group_definition.type type: keyword - name: aws.billing.group_by.INSTANCE_TYPE type: keyword - name: aws.billing.group_by.SERVICE type: keyword ```

gotext.tpl

``` {{- $currency := generate "aws.billing.currency" }} {{- $groupBy := generate "aws.billing.group_definition.key" }} {{- $period := generate "metricset.period" }} {{- $cloudId := generate "cloud.account.id" }} {{- $cloudRegion := generate "cloud.region" }} {{- $timestamp := generate "timestamp" }} { "@timestamp": "{{$timestamp.Format "2006-01-02T15:04:05.999999Z07:00"}}", "cloud": { "provider": "aws", "region": "{{$cloudRegion}}", "account": { "id": "{{$cloudId}}", "name": "{{generate "cloud.account.name"}}" } }, "event": { "dataset": "aws.billing", "module": "aws", "duration": {{generate "event.duration"}} }, "metricset": { "name": "billing", "period": {{$period}} }, "ecs": { "version": "8.2.0" }, "aws": { "billing": { {{- if eq $groupBy "" }} "Currency": "{{$currency}}", "EstimatedCharges": {{generate "aws.billing.EstimatedCharges"}}, "ServiceName": "{{generate "aws.billing.ServiceName"}}" {{- else }} {{- $sd := generate "aws.billing.start_date" }} "start_date": "{{ $sd.Format "2006-01-02T15:04:05.999999Z07:00" }}", "end_date": "{{ $sd | date_modify (print "+" $period "s") | date "2006-01-02T15:04:05.999999Z07:00" }}", "AmortizedCost": { "amount": {{printf "%.2f" (generate "aws.billing.AmortizedCost.amount")}}, "unit": "{{$currency}}" }, "BlendedCost": { "amount": {{printf "%.2f" (generate "aws.billing.BlendedCost.amount")}}, "unit": "{{$currency}}" }, "NormalizedUsageAmount": { "amount": {{generate "aws.billing.NormalizedUsageAmount.amount"}}, "unit": "N/A" }, "UnblendedCost": { "amount": {{printf "%.2f" (generate "aws.billing.UnblendedCost.amount")}}, "unit": "{{$currency}}" }, "UsageQuantity": { "amount": {{generate "aws.billing.UsageQuantity.amount"}}, "unit": "N/A" }, "group_definition": { "key": "{{$groupBy}}", "type": "{{generate "aws.billing.group_definition.type"}}" }, "path": "{{generate "path"}}", "hostIP": "{{generate "hostIP"}}", "group_by": { {{- if eq $groupBy "AZ"}} "AZ": "{{awsAZFromRegion $cloudRegion}}" {{- else if eq $groupBy "INSTANCE_TYPE"}} "INSTANCE_TYPE": "{{generate "aws.billing.group_by.INSTANCE_TYPE"}}" {{- else if eq $groupBy "SERVICE"}} "SERVICE": "{{generate "aws.billing.group_by.SERVICE"}}" {{- else if eq $groupBy "LINKED_ACCOUNT"}} "LINKED_ACCOUNT": "{{$cloudId}}" {{- end}} } {{- end}} } }, "service": { "type": "aws" }, "agent": { "id": "{{generate "agent.id"}}", "name": "{{generate "agent.name"}}", "type": "metricbeat", "version": "8.0.0", "ephemeral_id": "{{generate "agent.ephemeral_id"}}" } } ```

go run main.go generate-with-template ./gotext.tpl ./fields.yml --config-file ./configs.yml --tot-events 10

Relates: #141

aliabbas-elastic commented 2 months ago

@gpop63 This is good that we are able to add the formatting_pattern for IPs. However, I faced an issue when I had the type of the hostIP as ip instead of keyword. For keyword we are able to generate a field as {hostIP}:{port} but in case if there is only ip field then it is again should be mentioned as keyword. Is this something expected that we need to specify the dev whosoever is writing the templates?

gpop63 commented 2 months ago

@aliabbas-elastic Currently I only added it for keyword type but we can add it for ip type as well. We would need some additional checks to only allow {ipv4}, {ipv6} and {port} pattern generators for ip. WDYT?

aliabbas-elastic commented 2 months ago

@aliabbas-elastic Currently I only added it for keyword type but we can add it for ip type as well. We would need some additional checks to only allow {ipv4}, {ipv6} and {port} pattern generators for ip. WDYT?

@gpop63 Ok. I think it's fine for now as we are able to generate the required formats.

shmsr commented 2 months ago

@gpop63 @aliabbas-elastic

What do you think of this?

- name: hostIP
  cardinality: 25
  formatting_pattern: "{ipv4}:{port}|{ipv4}|{ipv6}"

This is just an example config. Suppose hostIP is of type ip (i understand that it is for keyword type here, but the flexibility will be nice) and by definition, it can support IPv4/ IPv6 addresses.

Why not have a split operator like the one I showed in the config? Gives us more flexibility. Also, there could be cases where the IP is there but not the port, as well as the IP with a port. There could be more such cases for different types. Given that values for some types could be very dynamic, shouldn't be nice if we do this?

The code will look something like this:

func replacePattern(pattern string) (string, error) {
    options := strings.Split(pattern, "|")
    chosenOption := options[rand.Intn(len(options))]

    // Define a map of placeholder replacements
    replacements := map[string]func() string{
        "{ipv4}": func() string {
            // logic
        },
        "{ipv6}": func() string {
            // logic
        },
        "{port}": // logic
        "{hostname}": // logic
    }

    // Replace each placeholder in the chosen option
    for placeholder, replacementFunc := range replacements {
        if strings.Contains(chosenOption, placeholder) {
            chosenOption = strings.Replace(chosenOption, placeholder, replacementFunc(), -1)
        }
    }

    return chosenOption, nil
}

chosenOption randomly chooses one of them and then the rest. Also, we do not need to depend on regexp as the patterns are simple. We can just do string matching.