apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.17k stars 14.32k forks source link

Unreachable Secrets Backend Causes Web Server Crash #14592

Closed john-jac closed 3 years ago

john-jac commented 3 years ago

Apache Airflow version:

1.10.12

Kubernetes version (if you are using kubernetes) (use kubectl version):

n/a

Environment:

What happened:

If an unreachable secrets.backend is specified in airflow.cfg the web server crashes

What you expected to happen:

An invalid secrets backend should be ignored with a warning, and the system should default back to the metadatabase secrets

How to reproduce it:

In an environment without access to AWS Secrets Manager, add the following to your airflow.cfg:

[secrets]
backend = airflow.contrib.secrets.aws_secrets_manager.SecretsManagerBackend

or an environment without access to SSM specifiy:

[secrets]
backend = airflow.contrib.secrets.aws_systems_manager.SystemsManagerParameterStoreBackend

Reference: https://airflow.apache.org/docs/apache-airflow/1.10.12/howto/use-alternative-secrets-backend.html#aws-ssm-parameter-store-secrets-backend

kaxil commented 3 years ago

@fhoda I know you talked about something similar, would you want to add your description too.

fhoda commented 3 years ago

I have not experienced the web server crashing (probably because of DAG serialization being enabled), but do have task failures that don't check the downstream Secrets Backends if the Alternative Secrets Backend fails to connect.

Airflow Version

Alternative Backend Hashicorp Vault

I think a warning and some sort and fail over checking of the Environment Variables then the Metastore DB would make sense.

fhoda commented 3 years ago

@john-jac can you tell me if you had DAG serialization turned on or not when you were seeing this issue?

subashcanapathy commented 3 years ago

DAG serialization is turned on for John's use case.

lidalei commented 3 years ago

It is likely a desired (though not nice) behavior in Airflow 2.0.2. We observed a similar behavior when using GCP Secret Manger and the service account key file couldn't be found. Any Airflow command would fail because of it.

Traceback (most recent call last): 
File "/home/airflow/.local/bin/airflow", line 5, in <module> from airflow.__main__ import main 
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/__init__.py", line 34, in <module> from airflow import settings 
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/settings.py", line 37, in <module> from airflow.configuration import AIRFLOW_HOME, WEBSERVER_CONFIG, conf # NOQA F401 
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/configuration.py", line 1099, in <module> secrets_backend_list = initialize_secrets_backends() 
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/configuration.py", line 1020, in initialize_secrets_backends custom_secret_backend = get_custom_secret_backend() 
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/configuration.py", line 1009, in get_custom_secret_backend return secrets_backend_cls(**alternative_secrets_config_dict) 
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/google/cloud/secrets/secret_manager.py", line 101, in __init__ self.credentials, self.project_id = get_credentials_and_project_id( 
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/google/cloud/utils/credentials_provider.py", line 309, in get_credentials_and_project_id return _CredentialProvider(*args, **kwargs).get_credentials_and_project() 
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/google/cloud/utils/credentials_provider.py", line 242, in get_credentials_and_project credentials, project_id = self._get_credentials_using_adc() 
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/google/cloud/utils/credentials_provider.py", line 295, in _get_credentials_using_adc credentials, project_id = google.auth.default(scopes=self.scopes) 
File "/home/airflow/.local/lib/python3.8/site-packages/google/auth/_default.py", line 346, in default credentials, project_id = checker() 
File "/home/airflow/.local/lib/python3.8/site-packages/google/auth/_default.py", line 189, in _get_explicit_environ_credentials credentials, project_id = load_credentials_from_file( 
File "/home/airflow/.local/lib/python3.8/site-packages/google/auth/_default.py", line 100, in load_credentials_from_file raise exceptions.DefaultCredentialsError( google.auth.exceptions.DefaultCredentialsError: 
File /secrets/xxx.json was not found.

How about catching the exception and logging the error when a secrets backend cannot be initialized, instead of exiting?

potiuk commented 3 years ago

If I understand correctly - this is only when you cannot reach secret backend when it is configured ? Not about just a value missing in the backend?

If my assumption is correct, then I have no doubts whatsoever that his is desired, nice and "good" behavior for airflow to crash hard in this case.

It's far worse to continue running, when you rely on the backend with variables to be available and they cannot be read where you expec them (and the assumption is that when you configure the backend, you want it to be reachable).

This might lead to a number of problems - bad processing of data, wrong databases updated, wrong calculations, you name it. Totally unpredictable, because we have no idea what the retrieved value will be used for, and we have no idea whether a "fallback default' is good or not.

It's NEVER a good idea to continue when you have unpredictable and unexpected configuration. Airflow as a system is not able to assess if the "default" value is "OK" to continue if the "expected" backend does not provide the right values (either because it is missing or wrongly configured). Actually it's even worse - Airflow is not even able to determine which kind of problem it is - is it wrongly configured? or just temporarily missing-in-action ? It will look exactly the same from Airlfow's point of view and this means that by not-crashing we accept that sometimes Airflow will run with configuration A and sometimes (when the secret backend is not reachable) with configuration B. This is no-go.

You either expect the secret backend to be available or not - there is no middle ground.

Crashing in this case is the only reasonable approach.

And I really like the output @lidalei copied. It is very straightworward and easy to find what the problem is. Again - we have no idea whether we can continue with default value or not. The only sensible approach is to crash hard.

lidalei commented 3 years ago

@potiuk Thank you very much for the detailed reasoning process! The concern is that if the an operation doesn't even need secrets backend (no need to use any connection or variable), should it crash or continue working? But generally it is a good idea to crash due to a bad configuration when starting an APP.

potiuk commented 3 years ago

@potiuk Thank you very much for the detailed reasoning process! The concern is that if the an operation doesn't even need secrets backend (no need to use any connection or variable), should it crash or continue working? But generally it is a good idea to crash due to a bad configuration when starting an APP.

Airflow is a distributed system there is never 'single' operation. Even if your operation does not need it, there can be many more tasks and jobs that might. They share co figuration and processes sometimes.

Really trying to pass on the bad configuration problem form the operator who should configure Airflow well to Airflow itself is very bad idea. Really crashing is the best signal Airflow can give to the human operating it 'please fix it'.

potiuk commented 3 years ago

Closing it for now until there is a gods reasoning why not to crash :)

subashcanapathy commented 3 years ago

A web application that is stateless (or intended to be that in long term) should fail gracefully on such situations. A secrets provider is at best a plugin and not a core feature. I would understand if the meta-DB connection failed and or the IDP provider connection failed - then it makes sense to prevent startup. If the customer made a mistake on configuring the secrets backend, booting up the web UI will make it obvious to them as the task failures and logs are viewable. Without this we are just assuming the user has access to logs on a box to be able to even understand where this went wrong.

Airflow webserver should be stateless representation of current state of things in the environment. I request to consider re-opening this so that we can make a configuration control like webserver.failsafe_secrets_backend = true

@potiuk @kaxil

potiuk commented 3 years ago

I am not at all convinced.

The Secrets Backend - when configured - is expected to work and it can hold crucial part of configuration that spans across all functionality of Airflow. Airflow is not able to asses if the secret backend is wrongly configured or this is temporary problem in communication so letting Airflow work when expected, crucial configuration service is not reachable is a very bad idea IMHO.

I think usually if services like Airflow have that big of a configuration problem - they should crash quickly at restart and even if that configuration is result of application in a managed service - it should be a signal to roll-back to previous configuration.

But i would love to hear what others think? Happy to reopen if the consensus will be different than my assessment .

subashcanapathy commented 3 years ago

If we treat the secrets backend as meta-DB like quintessential, then the worker and scheduler should reflect the same startup behavior. Why would the bar be higher for the webserver? In theory this is more of a safety trouble on the scheduler and worker as the DAGs and tasks are failing not being able to use the connection.

The webserver at the least presents a window of visibility into what is failing and why, by making it visible to the user through task failures and logs. We should then be consistent on the stance on all 3, right?

The problem is the webserver is intended to be stateless, but it hosts the CLI commands which makes it stateful in this ecosystem. Until we get to the future of either fully stateless webserver with decoupled REST API and CLI (AIP-38?), I believe that a failsafe configuration for the webserver would be extremely useful.

lidalei commented 3 years ago

If we treat the secrets backend as meta-DB like quintessential, then the worker and scheduler should reflect the same startup behavior. Why would the bar be higher for the webserver?

In theory this is more of a safety trouble on the scheduler and worker as the DAGs and tasks are failing not being able to use the connection.

The webserver at the least presents a window of visibility into what is failing and why, by making it visible to the user through task failures and logs. We should then be consistent on the stance on all 3, right?

The problem is the webserver is intended to be stateless, but it hosts the CLI commands which makes it stateful in this ecosystem. Until we get to the future of either fully stateless webserver with decoupled REST API and CLI (AIP-38?), I believe that a failsafe configuration for the webserver would be extremely useful.

I just got to know secrets backend can hold connections, variables and configuration.

potiuk commented 3 years ago

If we treat the secrets backend as meta-DB like quintessential, then the worker and scheduler should reflect the same startup behavior. Why would the bar be higher for the webserver?

Yep. They should fail. I was quite sure they are failing as well in this case. I think it should be fixed if they don't. I think it is

In theory this is more of a safety trouble on the scheduler and worker as the DAGs and tasks are failing not being able to use the connection.

Agree. They should fail if they are not currently.

The webserver at the least presents a window of visibility into what is failing and why, by making it visible to the user through task failures and logs. We should then be consistent on the stance on all 3, right?

Not if the configuration renders the whole airflow potentially unusable we can see failures of single tasks. But otherwise it is working. For me, misconfiguration of secrets backend (because of it's scope - it potentially impacts all configuration INCLUDING the metadata db configuration itself!!!!) Is straight equivalent of what happens if you misconfigure the metadata URL - which makes webserver crash

The problem is the webserver is intended to be stateless, but it hosts the CLI commands which makes it stateful in this ecosystem. Until we get to the future of either fully stateless webserver with decoupled REST API and CLI (AIP-38?), I believe that a failsafe configuration for the webserver would be extremely useful.

Agree that webserver should be stateless, but I think it does not matter if it is stateless or not . The UI/webserver accesses /can change the metadata DB. So what is more important is whether they are rad-only or not and whether the secret backend presence or not has an impact on actions triggered and changes done by the UI.

So more of a question is (and something that should be answered)

fhoda commented 3 years ago

As of now it seems the expected behavior is not what is happening and is inconsistent across different secret backends.

I have tried to reproduce this issue with Airflow 2.0 (main branch) and am not able to do so for any AWS secrets backends. I was only able to reproduce a crashing webserver for GCP Secret Manager and not any other secrets backend.

The GCP Secret Manager error seems more to do with the function to get the credentials and not the actual connection.

google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authenticati│google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentica
on/getting-started    

I used the airflow.providers.* secrets packages for each. I noticed that the original post on the issue uses the contrib package and Airflow 1.10.12.

#export AIRFLOW__SECRETS__BACKEND=airflow.providers.hashicorp.secrets.vault.VaultBackend
#export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_path": "airflow/connections", "variables_path": "airflow/variables", "config_path": "airflow/config", "url": "http://127.0.0.1:8200", "token": "$VAULT_TOKEN"}'

export AIRFLOW__SECRETS__BACKEND=airflow.providers.google.cloud.secrets.secret_manager.CloudSecretManagerBackend
export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "airflow-connections", "variables_prefix": "airflow-variables", "gcp_keyfile_dict": $GCP_SECRET_MANAGER_SA_KEY}'

#export AIRFLOW__SECRETS__BACKEND=airflow.providers.microsoft.azure.secrets.azure_key_vault.AzureKeyVaultBackend
#export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "airflow-connections", "variables_prefix": null, "vault_url": "https://example-akv-resource-name.vault.azure.net/"}'

#export AIRFLOW__SECRETS__BACKEND=airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend
#export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "airflow/connections", "variables_prefix": "airflow/variables", "profile_name": "default"}'

#export AIRFLOW__SECRETS__BACKEND=airflow.providers.amazon.aws.secrets.systems_manager.SystemsManagerParameterStoreBackend
#export AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "/airflow/connections", "variables_prefix": "/airflow/variables", "profile_name": "default"}'

Here are my findings:

I believe we should evaluate what the expected behavior should be as compared to what is actually happening.

Also after discussing with @kaxil there may be a middle ground for fail over implementation that could make sense here.

  1. If configs are being retrieved through the secrets backend then a failure makes sense.
  2. If connections and/or variables are not able to be retrieved, then fail over could be a strategy used by users to ensure DAG/task success and predictable execution.
potiuk commented 3 years ago

Agree we have consistency issue here - Interestingly, the AWS secret manager crashed originally for @subashcanapathy and @john-jac but did not crash for you @fhoda. Not sure what the reason is for that (maybe the 1.10 vs 2.* behavioral difference)?

I really like the idea of different behavior for different type of access. I think it answers my concerns perfectly and what it really boils down to is "who" is the "client" - whether it is "airflow" or the "DAG/task writer".

I think the main difference of configuration vs. variables and connections is that Airflow has default values for most of the configurations and when they are not found, they will fall-back to the default values - which might alter behavior of airflow. So lack of secrets backend when it is configured and configuration is retrieved is very dangerous. And since it is accessed under-the hood by Airflow, without the "dag" or "task" using it, it's airflow that is the "client" and it's airflow that should handle it (and crashing is the only reasonable behavior IMHO). Simply "dag writer" is not in a control to make any decision here.

This is (as you rightfully noticed), far less of a concern for connections and variables - "clients" for those are "dag writers". Whoever uses them should be prepared for what happens when the secret backend is missing. Either the "writers" will prepare fallback values for those in the DB or they will have to handle "missing" value somehow (and this is up to the 'user' what to do in this case). But they are in full control, there is no need to crash Airflow (yet! - until configuration is not accessed by Airflow itself).

Reopening it as it might actually be an actionable item to do :)

@subashcanapathy , @john-jac - would that be a reasonable approach for you as well ?

fhoda commented 3 years ago

Since we are in agreement about the fail over being appropriate for Connections and Variables, I have started to work on those. I will address the inconsistency in behavior and config use case after we determine what the final state should be for those.