astronomer / astronomer-cosmos

Run your dbt Core projects as Apache Airflow DAGs and Task Groups with a few lines of code
https://astronomer.github.io/astronomer-cosmos/
Apache License 2.0
749 stars 170 forks source link

Expose database connection details to K8sPods when using `ExecutionMode.KUBERNETES` #749

Open tatiana opened 11 months ago

tatiana commented 11 months ago

Context

One of Cosmos' popular features is to allow users to define how to connect to a database as Airflow connections, and it can generate the dbt profile - so users do not need to manage sensitive information in two places. This is accomplished via profile mapping classes, as described in: https://astronomer.github.io/astronomer-cosmos/profiles/index.html#using-a-profile-mapping

Unfortunately, this feature works for ExecutionMode.LOCAL and ExecutionMode.VIRTUALENV, but not for ExecutionMode.DOCKER and ExecutionMode.KUBERNETES. This was a limitation that was discussed when these execution modes were introduced, and the workaround is for the end-users to manage this themselves by having a dbt profiles.yml file baked into the container image and setting sensitive information in the way they prefer (such as via Kubernetes secrets).

Since ExecutionMode.KUBERNETES is more popular than ExecutionMode.DOCKER, this ticket aims to discuss and review if and how we could improve this. There are two key questions:

i) Should Cosmos support creating the profiles.yml for the end-user when using ExecutionMode.KUBERNETES ii) How would Cosmos expose the file itself and the sensitive information in case we decide to do (i)

Some possibilities

i) Creating/exposing profiles.yml

When using Cosmos Local operators, we already create this file when users use a profile mapping. We could do the same for K8s.

The difference would be that we'd need to expose the created file to K8s. A way to do this from Airflow is to use volumes, as illustrated in:

Are we happy for Cosmos to set up this volume in users of K8s Pods? How much control should users have to configure the volumes used for this purpose?

ii) Exposing sensitive information

(a) Kubernetes allows users to set environment variables during the Pod creation, and this could be set via Airflow:

apiVersion: v1
kind: Pod
metadata:
  name: envar-demo
  labels:
    purpose: demonstrate-envars
spec:
  containers:
  - name: envar-demo-container
    image: gcr.io/google-samples/node-hello:1.0
    env:
    - name: DEMO_GREETING
      value: "Hello from the environment"
    - name: DEMO_FAREWELL
      value: "Such a sweet sorrow"

https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/

However, this approach exposes sensitive information in the PodDescription, which can raise security implications.

(b) A more secure approach is usually to create Kubernetes Secrets and make those available to pods. This is illustrated in:

This would mean Cosmos would be creating/potentially overriding other Kubernetes Secrets managed by the end-user. Would users be happy with this approach? Should Cosmos delete the secrets afterwards?

We welcome your thoughts!

dosubot[bot] commented 11 months ago

πŸ€–

Hello @tatiana,

It's great to see you back! I hope you've been doing well. Thank you for this detailed issue report with your thoughts and suggestions. Let's dive into it.

Based on the context provided, it seems that Cosmos should indeed support creating the profiles.yml for the end-user when using ExecutionMode.KUBERNETES. This is because the profiles.yml file contains sensitive information such as the host, http_path, and token, which are required for connecting to the database.

The best way to expose the file and the sensitive information is by creating Kubernetes Secrets. Kubernetes Secrets are designed to store and manage sensitive information, such as passwords, OAuth tokens, and ssh keys. Storing confidential information in a Secret is safer and more flexible than putting it verbatim in a Pod definition or in a container image.

Here is an example of how you can create a Kubernetes Secret for the profiles.yml file:

kubectl create secret generic dbt-profiles --from-file=profiles.yml=./profiles.yml

And then you can mount the secret as a volume in your Pod:

apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  containers:
  - name: mypod
    image: myimage
    volumeMounts:
    - name: dbt-profiles-volume
      mountPath: "/path/to/profiles.yml"
  volumes:
  - name: dbt-profiles-volume
    secret:
      secretName: dbt-profiles

This way, the profiles.yml file will be available in your Pod, and the sensitive information will be securely stored in the Kubernetes Secret.

As for how the profiles.yml file is currently being generated and managed in the Cosmos codebase for ExecutionMode.LOCAL and ExecutionMode.VIRTUALENV, it is managed in the ProfileConfig class in the cosmos/config.py file.

For ExecutionMode.LOCAL and ExecutionMode.VIRTUALENV, the profiles.yml file can be generated in two ways:

  1. Using a user-supplied profiles.yml file: If the profiles_yml_filepath attribute is set, the ProfileConfig class uses this file directly. This is validated in the validate_profiles_yml method, which checks if the file exists at the given path.

  2. Using Cosmos to map Airflow connections to dbt profiles: If the profile_mapping attribute is set to a subclass of BaseProfileMapping, Cosmos generates the profiles.yml file. This is done in the ensure_profile method, which creates a temporary profiles.yml file with the contents returned by the get_profile_file_contents method of the BaseProfileMapping subclass.

In both cases, the ProfileConfig class validates that either profiles_yml_filepath or profile_mapping is set, but not both, as they are mutually exclusive. This is done in the validate_profile method.

You can find more details in the cosmos/config.py file.

I hope this helps! If you have any other questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.