airflow-helm / charts

The User-Community Airflow Helm Chart is the standard way to deploy Apache Airflow on Kubernetes with Helm. Originally created in 2017, it has since helped thousands of companies create production-ready deployments of Airflow on Kubernetes.
https://github.com/airflow-helm/charts/tree/main/charts/airflow
Apache License 2.0
631 stars 473 forks source link

Azure OAuth CSRF State Not Equal Error #678

Closed ahipp13 closed 1 year ago

ahipp13 commented 1 year ago

Checks

Chart Version

8.6.1

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5-gke.1300", GitCommit:"90a16981ade07f163a0233adb631b42ac1fc53ff", GitTreeState:"clean", BuildDate:"2021-10-06T09:26:44Z", GoVersion:"go1.16.7b7", Compiler:"gc", Platform:"linux/amd64"}

Helm Version

version.BuildInfo{Version:"v3.4.1", GitCommit:"c4e74854886b2efe3321e185578e6db9be0a6e29", GitTreeState:"clean", GoVersion:"go1.14.11"}

Description

We are currently running Airflow 2.4.3 on Kubernetes with the Airflow Community helm chart version 8.6.1. We are also using a postgres external database as our metadata db. We have enabled Microsoft Azure OAuth for our Airflow implementation. When we try to log in, we get a CSRF State Mismatch error. These logs are below. I have posted everywhere about this problem we are having and have had no success, so am seeing if anybody else using this helm chart has seen this. The webserver_config we use to configure the oauth is below:

from flask_appbuilder.security.manager import AUTH_OAUTH
from airflow.www.security import AirflowSecurityManager
import logging
from typing import Dict, Any, List, Union
import os
import sys

#Add this as a module to pythons path
sys.path.append('/opt/airflow')

log = logging.getLogger(__name__)
log.setLevel(os.getenv("AIRFLOW__LOGGING__FAB_LOGGING_LEVEL", "DEBUG"))

class AzureCustomSecurity(AirflowSecurityManager):
    # In this example, the oauth provider == 'azure'.
    # If you ever want to support other providers, see how it is done here:
    # https://github.com/dpgaspar/Flask-AppBuilder/blob/master/flask_appbuilder/security/manager.py#L550
    def get_oauth_user_info(self, provider, resp):
        # Creates the user info payload from Azure.
        # The user previously allowed your app to act on their behalf,
        #   so now we can query the user and teams endpoints for their data.
        # Username and team membership are added to the payload and returned to FAB.
        if provider == "azure":
            log.debug("Azure response received : {0}".format(resp))
            id_token = resp["id_token"]
            log.debug(str(id_token))
            me = self._azure_jwt_token_parse(id_token)
            log.debug("Parse JWT token : {0}".format(me))
            return {
                "name": me.get("name", ""),
                "email": me["upn"],
                "first_name": me.get("given_name", ""),
                "last_name": me.get("family_name", ""),
                "id": me["oid"],
                "username": me["oid"],
                "role_keys": me.get("roles", []),
            }

# Adding this in because if not the redirect url will start with http and we want https
os.environ["AIRFLOW__WEBSERVER__ENABLE_PROXY_FIX"] = "True"
WTF_CSRF_ENABLED = False
CSRF_ENABLED = False
AUTH_TYPE = AUTH_OAUTH
AUTH_ROLES_SYNC_AT_LOGIN = True  # Checks roles on every login
# Make sure to replace this with the path to your security manager class
FAB_SECURITY_MANAGER_CLASS = "webserver_config.AzureCustomSecurity"
# a mapping from the values of `userinfo["role_keys"]` to a list of FAB roles
AUTH_ROLES_MAPPING = {
    "airflow_dev_admin": ["Admin"],
    "airflow_dev_op": ["Op"],
    "airflow_dev_user": ["User"],
    "airflow_dev_viewer": ["Viewer"]
    }
# force users to re-auth after 30min of inactivity (to keep roles in sync)
PERMANENT_SESSION_LIFETIME = 1800
# If you wish, you can add multiple OAuth providers.
OAUTH_PROVIDERS = [
    {
        "name": "azure",
        "icon": "fa-windows",
        "token_key": "access_token",
        "remote_app": {
            "client_id": "CLIENT_ID",
            "client_secret": 'AZURE_DEV_CLIENT_SECRET',
            "api_base_url": "https://login.microsoftonline.com/TENANT_ID",
            "request_token_url": None,
            'request_token_params': {
                'scope': 'openid email profile'
            },
            "access_token_url": "https://login.microsoftonline.com/TENANT_ID/oauth2/v2.0/token",
            "access_token_params": {
                'scope': 'openid email profile'
            },
            "authorize_url": "https://login.microsoftonline.com/TENANT_ID/oauth2/v2.0/authorize",
            "authorize_params": {
                'scope': 'openid email profile',
            },
            'jwks_uri':'https://login.microsoftonline.com/common/discovery/v2.0/keys',
        },
    },
]

Relevant Logs

[2022-11-28 22:04:58,744] {views.py:659} ERROR - Error authorizing OAuth access token: mismatching_state: CSRF Warning! State not equal in request and response. ││ airflow-web [2022-11-28 22:04:58,744] {views.py:659} ERROR - Error authorizing OAuth access token: mismatching_state: CSRF Warning! State not equal in request and response.

Custom Helm Values

########################################
## CONFIG | Airflow Configs
########################################
containerSecurityContext:
  capabilities:
    drop:
      - KILL
      - MKNOD
      - SYS_CHROOT

basicProbes:
  livenessProbe:
    exec:
      command:
      - cat
      - /etc/os-release
    initialDelaySeconds: 5
    periodSeconds: 5
  readinessProbe:
    exec:
      command:
      - cat
      - /etc/os-release
    initialDelaySeconds: 5
    periodSeconds: 5

airflow:

  ## configs for the airflow container image
  ##
  image:
    repository: OUR DOCKER REPO
    tag: DOCKER_TAG

    gid: 50000

  ## the airflow executor type to use
  ## - allowed values: "CeleryExecutor", "KubernetesExecutor", "CeleryKubernetesExecutor"
  ## - customize the "KubernetesExecutor" pod-template with `airflow.kubernetesPodTemplate.*`
  ##
  executor: KubernetesExecutor

  config:
    AIRFLOW_HOME: "/opt/airflow"
    AIRFLOW__LOGGING__LOGGING_LEVEL: "DEBUG"
    AIRFLOW__CORE__LOAD_EXAMPLES: "False"
    AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: 10
    AIRFLOW__WEBSERVER__BASE_URL: "URL"
    AIRFLOW__LOGGING__REMOTE_LOGGING: "True"
    AIRFLOW__LOGGING__LOGGING_CONFIG_CLASS: log_config.DEFAULT_LOGGING_CONFIG
    AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER: "wasb-airflow-logs"
    AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID: "az-conn"
    AIRFLOW__WEBSERVER__SESSION_LIFETIME_MINUTES: "5"
    AIRFLOW__WEBSERVER__WORKERS: "1"
    # Azure key vault env variables
    AZURE_TENANT_ID: "TENANT ID"
    AZURE_CLIENT_ID: "CLIENT ID"
    AZURE_CLIENT_SECRET: "AZURE_TERRAFORM_DEV_CLIENT_SECRET"
    AZURE_KEY_VAULT_URI: KEY VAULT URI

  usersUpdate: false

  defaultSecurityContext:
    ## sets the filesystem owner group of files/folders in mounted volumes
    ## this does NOT give root permissions to Pods, only the "root" group
    runAsUser: 50000
    runAsGroup: 50000
    fsGroup: 50000

    ## resource requests/limits for the Pod template "base" container
    ## - spec for ResourceRequirements:
    ##   https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.20/#resourcerequirements-v1-core
    ##
    resources:
      requests:
        memory: "4000Mi"
        cpu: "3000m"  
      limits:
        memory: "8000Mi"
        cpu: "4000m"

  ########################################
  ## COMPONENT | db-migrations Deployment
  ########################################
  dbMigrations:
    ## if the db-migrations Deployment/Job is created
    ## - [WARNING] if `false`, you have to MANUALLY run `airflow db upgrade` when required
    ##
    enabled: true

    livenessProbe:
      enabled: true
      initialDelaySeconds: 10
      periodSeconds: 30
      timeoutSeconds: 60
      failureThreshold: 5

    readinessProbe:
      enabled: true
      initialDelaySeconds: 10
      periodSeconds: 30
      timeoutSeconds: 60
      failureThreshold: 5

    resources:
      requests:
        memory: "500Mi"
        cpu: "250m"  
      limits:
        memory: "750Mi"
        cpu: "500m"

  ########################################
  ## COMPONENT | Sync Deployments
  ########################################
  ## - used by the Deployments/Jobs used by `airflow.{connections,pools,users,variables}`
  ##
  sync:
    ## resource requests/limits for the sync Pods
    ## - spec for ResourceRequirements:
    ##   https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.20/#resourcerequirements-v1-core
    ##
    resources:
      requests:
        memory: "250Mi"
        cpu: "250m"  
      limits:
        memory: "500Mi"
        cpu: "500m"

###################################
## COMPONENT | Airflow Scheduler
###################################
scheduler:
  ## the number of scheduler Pods to run
  ## - if you set this >1 we recommend defining a `scheduler.podDisruptionBudget`
  ##
  replicas: 1

  ## resource requests/limits for the scheduler Pod
  ## - spec of ResourceRequirements:
  ##   https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.20/#resourcerequirements-v1-core
  ##
  resources:
    requests:
      memory: "2500Mi"
      cpu: "1000m"  
    limits:
      memory: "3000Mi"
      cpu: "1500m"

  logCleanup:
    ## if the log-cleanup sidecar is enabled
    ## - [WARNING] must be disabled if `logs.persistence.enabled` is `true`
    ##
    enabled: false

  readinessProbe:
    enabled: true
    initialDelaySeconds: 10
    periodSeconds: 30
    timeoutSeconds: 60
    failureThreshold: 5

  ## resource requests/limits for the web Pod
  ## - spec for ResourceRequirements:
  ##   https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.20/#resourcerequirements-v1-core
  ##
  resources:
    requests:
      memory: "2500Mi"
      cpu: "1000m"  
    limits:
      memory: "3000Mi"
      cpu: "1500m"

###################################
## COMPONENT | Airflow Workers
###################################
workers:
  ## if the airflow workers StatefulSet should be deployed
  ##
  enabled: false

  logCleanup:
    ## if the log-cleanup sidecar is enabled
    ## - [WARNING] must be disabled if `logs.persistence.enabled` is `true`
    ##
    enabled: false

###################################
## COMPONENT | Triggerer
###################################
triggerer:

  resources:
    requests:
      memory: "2500Mi"
      cpu: "1000m"  
    limits:
      memory: "3000Mi"
      cpu: "1500m"

  readinessProbe:
    enabled: true
    initialDelaySeconds: 10
    periodSeconds: 30
    timeoutSeconds: 60
    failureThreshold: 5

###################################
## CONFIG | Airflow Logs
###################################
logs:

  persistence:
    ## if a persistent volume is mounted at `logs.path`
    ##
    enabled: true

    storageClass: "nas-thin"

  ## configs for the git-sync sidecar (https://github.com/kubernetes/git-sync)
  ##
  gitSync:
    ## if the git-sync sidecar container is enabled
    ##
    enabled: true

      ## configs for the flower Pods' readinessProbe probe
      ##
    readinessProbe:
      enabled: true

  ## configs for the flower Pods' liveness probe
  ##
    livenessProbe:
      enabled: true

    ## the git-sync container image
    ##
    image:
      #registry: docker.repo1.uhc.com
      repository: OUR DOCKER GITHUB SYNC REPO
      tag: v3.5.0
      pullPolicy: IfNotPresent
      uid: 65533
      gid: 65533

    ## resource requests/limits for the git-sync container
    ## - spec for ResourceRequirements:
    ##   https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.20/#resourcerequirements-v1-core
    ##
    resources:
      requests:
        memory: "250Mi"
        cpu: "250m"  
      limits:
        memory: "500Mi"
        cpu: "500m"

    ## the url of the git repo
    ##
    ## ____ EXAMPLE _______________
    ##   # https git repo
    ##   repo: "https://github.com/USERNAME/REPOSITORY.git"
    ##
    ## ____ EXAMPLE _______________
    ##   # ssh git repo
    ##   repo: "git@github.com:USERNAME/REPOSITORY.git"
    ##
    repo: OUR GITHUB REPO

    ## the sub-path within your repo where dags are located
    ## - only dags under this path within your repo will be seen by airflow,
    ##   (note, the full repo will still be cloned)
    ##
    repoSubPath: "/dags"

    ## the git branch to check out
    ##
    branch: main

    sshSecretKey: ""

###################################
## DATABASE | PgBouncer
###################################
pgbouncer:
  ## if the pgbouncer Deployment is created
  ##
  enabled: false

###################################
## DATABASE | Embedded Postgres
###################################
postgresql:
  ## if the `stable/postgresql` chart is used
  ## - [WARNING] the embedded Postgres is NOT SUITABLE for production deployments of Airflow
  ## - [WARNING] consider using an external database with `externalDatabase.*`
  ## - set to `false` if using `externalDatabase.*`
  ##
  enabled: false

###################################
## DATABASE | External Database
###################################
externalDatabase:

  ## the host of the external database
  ##
  host: DATABASE_HOST

  database: airflow_db

  ## the username for the external database
  ##
  user: airflow_user

  password: "AIRFLOW_DEV_DB_PASSWORD"

  passwordSecretKey: ""

###################################
## DATABASE | Embedded Redis
###################################
redis:
  ## if the `stable/redis` chart is used
  ## - set to `false` if `airflow.executor` is `KubernetesExecutor`
  ## - set to `false` if using `externalRedis.*`
  ##
  enabled: false
thesuperzapper commented 1 year ago

@ahipp13 I doubt this is an issue with the helm chart, I recommend raising an issue on Flask-AppBuilder (https://github.com/dpgaspar/Flask-AppBuilder) or Airflow itself.

ahipp13 commented 1 year ago

Hi @thesuperzapper, thank you for responding.

I have already posted in airflow itself, as well as Flask App Builder and authlib. The guy at airflow told me to post here so I did. I just wanted to get everybody’s eyes on this so that I could eliminate some options.

If you are confident this would not be a chart issue then that takes away a possibility, so I appreciate your response. Thank you!

thesuperzapper commented 1 year ago

@ahipp13 Can you link those other issues (for posterity's sake)?

Also, I thought that Azure was natively supported by Flask-AppBuilder now, so why do you need to use a custom get_oauth_user_info() retrieval?

Regarding the CSRF issue itself, my 2 cents are that:

ahipp13 commented 1 year ago

Here are the links to the other issues I have opened:

Airflow: https://github.com/apache/airflow/discussions/28098 FAB: https://github.com/dpgaspar/Flask-AppBuilder/issues/1957 Authlib: https://github.com/lepture/authlib/issues/518

The reason for the custom user info is just that that is how we had it working before it broke so we have kept it. Never thought of getting rid of that, that is something that I can try, although I do not think it will change anything.

With updating airflow to 2.4.3, it means that you have to use the new version of FAB, which in turn now uses Authlib. I did a lot of painful debugging yesterday and think that the problem is coming from authlib. For some reason the state is coming into authlib as "None" and that is what is triggering the errors.

The authlib guy responded to me last night so I am going to try what he said even though he did not give me a lot of info. I will keep this post updated.

ahipp13 commented 1 year ago

This is closed and has been fixed, please refer to here: https://github.com/apache/airflow/discussions/28099