Remove need of duplicates between config.yaml and terraform variables

constanca-m commented 2 months ago

Currently, to use these terraform configuration files we need to specify the inputs and then use the same ARNS presented there in the variables of this repository.

Example: We have a config.yaml that looks like this:

"inputs":
- "id": "arn:aws:logs:eu-central-1:627286350134:log-group:constanca-benchmark-test-cloudwatch-lg:*"
  "outputs":
  - "args":
      "api_key": "..."
      "elasticsearch_url": "..."
      "es_datastream_name": "logs-esf.test-default"
    "type": "elasticsearch"
  "type": "cloudwatch-logs"

In this case, we have cloudwatch logs input arn:aws:logs:eu-central-1:627286350134:log-group:constanca-benchmark-test-cloudwatch-lg:*, so we need to declare it in our variables as well:

cloudwatch-logs = [{
  arn = "arn:aws:logs:eu-central-1:627286350134:log-group:constanca-benchmark-test-cloudwatch-lg:*"
}]

This is obviously not ideal, especially if the user is using many inputs.

I have thought of 3 options to remove the need of duplicates.

Option 1 I believe this is the best option.

In this option we would need to:

Create a S3 bucket, to place the configuration file:

resource "aws_s3_bucket" "config-bucket" {
bucket = "..."
}

Add a new variable:

variable "inputs" {
description = "List of inputs to ESF"
type = list(object({
type = string
id   = string
outputs = list(object({
  type = string
  args = list(object({
    elasticsearch_url  = string
    cloud_id           = string
    api_key            = string
    username           = string
    password           = string
    es_datastream_name = string
    batch_max_actions  = number
    batch_max_bytes    = number
  }))
}))
}))
default = []
}

Place the inputs variable in a config.yaml file and upload it to our new S3 bucket.
Remove all other variables for the ARNs.

With this option we:

Remove the need for the user to create its own S3 bucket and place their config.yaml file
Remove the need to duplicate the ARNs in the config.yaml and in our terraform variables

Option 2 In this option we:

Change the config type from YAML to JSON.
Since the config type is now JSON we can use aws_s3_object to get the content of the file - this only works if a file is TXT or JSON
Use the content obtained to get the necessary ARNs

With this option:

Remove the need to duplicate the ARNs in the config.yaml and in our terraform variables
The user still needs to create its own S3 bucket with the config file
Also check: is there any hidden consequences of changing to JSON?

Option 3 In this option we:

Locally download the configuration file from the S3 bucket.
Iterate over the content to obtain the ARNs.

With this option:

Remove the need to duplicate the ARNs in the config.yaml and in our terraform variables
The user still needs to create its own S3 bucket with the config file
We add new dependencies since there are new commands executing locally

gizas commented 2 months ago

Great details @constanca-m and thanks for the issue. Adding some thoughts:

I prefer option 1 at the moment because it is cleaner and less changes in the current code
For option 1, you add a new list. Because with this list you will need to match different scenarios we need to check whether the list has fields not mandatory. Otherwise we might need to create different kind of lists and iterate on all of them?
For option 2, we need to check if the yaml file is used also in the python code of lamda
Besides that, my feeling is that yaml is more user friendly for placing config rather than json. Another argument for using only option 1
Remove the need for the user to create its own S3 bucket and place their config.yaml file Giving the option to the user to have an S3 bucket with config and not create it with terraform is a "feature". I mean we can keep this conditional and even in the option 1 to allow users to provide an existing s3

constanca-m commented 2 months ago

Because with this list you will need to match different scenario we need to check whether the list has fields not mandatory. Otherwise we might need to create different kind of lists and iterate on all of them?

I checked, and there are optional object types in terraform. I noticed a while ago we are actually already using them.

We also could (or should) add validation rules to check the requirements:

Either elasticsearch_url or cloud_id, elasticsearch_url takes precedence if both are included
Either api_key or username/password, username/password takes precedence if both are included
The output type (elasticsearch, logstash, ...)
The input type (cloudwatch, sqs, ...)

Like this:

variable "inputs" {
  description = "List of inputs to ESF"
  type = list(object({
    type = string
    id   = string
    outputs = list(object({
      type = string
      args = list(object({
        elasticsearch_url  = optional(string)
        cloud_id           = optional(string)
        api_key            = optional(string)
        username           = optional(string)
        password           = optional(string)
        es_datastream_name = string
        batch_max_actions  = optional(number, 500)
        batch_max_bytes    = optional(number, 10485760)
      }))
    }))
  }))
  default = []

  validation {
    condition = alltrue([
      for input in var.inputs : alltrue([
        for output in input.outputs : alltrue([
          for args in output.args : alltrue([
            args.elasticsearch_url == null && args.cloud_id == null ? false : true
          ])
        ])
      ])
    ])
    error_message = "All outputs must contain elasticsearch_url or cloud_id. elasticsearch_url takes precedence if both are included."
  }
}

Example of usage:

inputs = [
  {
    type = "type1"
    id   = "id1"
    outputs = [
      {
        type = "output1"
        args = [
          {
            #elasticsearch_url = "url"
            #cloud_id          = "cloudid"
            api_key           = "apikey"
            username          = "username"
            password          = "password"
            es_datastream_name = "es-default"
          }
        ]
      }
    ]
  }
]

Fails with:

╷
│ Error: Invalid value for variable
│ 
│   on variables.tf line 85:
│   85: variable "inputs" {
│     ├────────────────
│     │ var.inputs is list of object with 1 element
│ 
│ All outputs must contain elasticsearch_url or cloud_id. elasticsearch_url takes precedence if both are included.
│ 
│ This was checked by the validation rule at variables.tf:106,3-13.

I mean we can keep this conditional and even in the option 1 to allow users to provide an existing s3

~~The issue with this, is that then we would have to be applying either option 2 or 3 because we need to obtain the ARNs in some way. If the file would remain as YAML, then it would be option 3.~~

Sorry, I actually misunderstood. With an existing S3 bucket we can also use aws_s3_object to upload the file. Good option :+1: @gizas

gizas commented 2 months ago

Thank you!!!

In this:

type = list(object({
    type = string
    id   = string

The id ---> equals to arn. The field name is not so descriptive . But it is minor

The issue with this, is that then we would have to be applying either option 2 or 3 because we need to obtain the ARNs

Indeed the option 1 becomes equal to 3. I would say that is valuable to have both 1 and 3 as options for the users with a flag or so to distinguish if the users want to create the s3 or not. What do you think?

constanca-m commented 2 months ago

Indeed the option 1 becomes equal to 3. I would say that is valuable to have both 1 and 3 as options for the users with a flag or so to distinguish if the users want to create the s3 or not. What do you think?

I was wrong when I wrote that. The S3 bucket can exist, and we should consider that. I updated the comment to fix my mistake @gizas

constanca-m commented 2 months ago

I am having a deeper look at option 1, but maybe having that variable is too confusing.

The variable would have to be

type = list(object({
    type = string
    id   = string
    outputs = list(object({
      type = string
      args = list(object({
        elasticsearch_url      = optional(string)
        logstash_url           = optional(string)
        cloud_id               = optional(string)
        api_key                = optional(string)
        username               = optional(string)
        password               = optional(string)
        es_datastream_name     = string
        batch_max_actions      = optional(number, 500)
        batch_max_bytes        = optional(number, 10485760)
        ssl_assert_fingerprint = optional(string)
        compression_level      = optional(string)
      }))
    }))
  }))

So if we have 2 inputs, this could look like this:

inputs = [
  {
    type = "cloudwatch-logs"
    id   = "id1"
    outputs = [
      {
        type = "elasticsearch"
        args = [
          {
            elasticsearch_url = "url"
            api_key            = "apikey"
            es_datastream_name = "es-default"
          }
        ]
      }
    ]
  },
  {
    type = "cloudwatch-logs"
    id   = "id1"
    outputs = [
      {
        type = "elasticsearch"
        args = [
          {
            elasticsearch_url = "url"
            api_key            = "apikey"
            es_datastream_name = "es-default"
          }
        ]
      }
    ]
  }
]

Would it just be easier to have the config.yaml locally along with the terraform files? This way, we can have a variable config_path and read the file as file(${var.config_path}).

Pros of this approach:

The user keeps using YAML instead of changing to a new terraform variable
Also addresses

my feeling is that yaml is more user friendly for placing config rather than json

Cons:

We cannot validate if the configuration is correct, at least as easily as with a terraform variable

axw commented 2 months ago

I also like option 1, and if we can, I think it would be preferable to not require a separate config.yaml.

I also tend to agree with @gizas that if we could optionally support config.yaml that would be great -- that would provide an escape hatch in case the user wants to do something unusual. I guess this gets complicated though, because then we would also need to optionally support specifying IAM privileges?

Just a thought: would it make sense to provide a simpler configuration for Terraform, such that you could only specify one output, but many inputs?

So you would have separate variables for elasticsearch_url, logstash_url, etc., and (say) cloudwatch_log_group_arns. That last one would be used to build the input list, and the output would be identical for every input apart from es_datastream_name (which I'm not sure off hand how we would specify).

constanca-m commented 2 months ago

I guess this gets complicated though, because then we would also need to optionally support specifying IAM privileges?

I don't think it would get complicated in the sense. We always need to grant s3:GetObject because the lambda itself needs the config.yaml file in the bucket. If it is not there, we will upload it.

Regarding:

I think it would be preferable to not require a separate config.yaml.

Does this mean we should add a variable @axw ? I am leaning towards having a config.yaml file locally that we will upload in the S3 bucket mainly to keep the logic, but I also like the variable.

would it make sense to provide a simpler configuration for Terraform, such that you could only specify one output, but many inputs

It could be an option. I would have to think how that would work though... It also means the logic would be different to the one we used for config.yaml , and I don't know if it is good idea to do that.

Maybe it would be best to tackle that in a different issue... and change the code directly to allow us to have shared outputs. Instead of doing that just for terraform. What do you think?

axw commented 2 months ago

I guess this gets complicated though, because then we would also need to optionally support specifying IAM privileges?

I don't think it would get complicated in the sense. We always need to grant s3:GetObject because the lambda itself needs the config.yaml file in the bucket. If it is not there, we will upload it.

What I meant was that if we support an arbitrary config.yaml, then we would need to work out which resources that ESF is allowed to receive data from -- Kinesis data streams, SQS queues, etc. At the moment you have to specify this as a variable. So we would either have to keep that, or parse them from the config file. Right?

Looks like in https://github.com/elastic/terraform-elastic-esf/pull/7 you're parsing the local config file. I think this is a fine approach.

axw commented 2 months ago

would it make sense to provide a simpler configuration for Terraform, such that you could only specify one output, but many inputs

It could be an option. I would have to think how that would work though... It also means the logic would be different to the one we used for config.yaml , and I don't know if it is good idea to do that.

Maybe it would be best to tackle that in a different issue... and change the code directly to allow us to have shared outputs. Instead of doing that just for terraform. What do you think?

Fair enough. I was just thinking we could support only a limited set of functionality to start with, and keep the Terraform module simpler. But probably best to keep these things separate as you say.

constanca-m commented 2 months ago

then we would need to work out which resources that ESF is allowed to receive data from -- Kinesis data streams, SQS queues, etc

Yes, the variable also has a validation rule on which inputs it can get. But I believe the ESF code checks that as well.

elastic / terraform-elastic-esf

Remove need of duplicates between config.yaml and terraform variables #6