apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.86k stars 14.25k forks source link

Airflow does not pass through Celery's support for Redis Sentinel over SSL. #28010

Closed jonathanjuursema closed 1 year ago

jonathanjuursema commented 1 year ago

Apache Airflow version

2.4.3

What happened

When configuring Airflow/Celery to use Redis Sentinel as a broker, the following pops up:

airflow.exceptions.AirflowException: The broker you configured does not support SSL_ACTIVE to be True. Please use RabbitMQ or Redis if you would like to use SSL for broker.

What you think should happen instead

Celery has supported TLS on Redis Sentinel for a while now.

It looks like this piece of code explicitly prohibits from passing a valid Redis Sentinel TLS configuration through to Celery. (Since Sentinel broker URL's are prefixed with sentinel:// instead of redis://.)

How to reproduce

This problem can be reproduced by deploying Airflow using Docker with the following environment variables:

AIRFLOW__CELERY__BROKER_URL=sentinel://sentinel1:26379;sentinel://sentinel2:26379;sentinel://sentinel3:26379
AIRFLOW__CELERY__SSL_ACTIVE=true
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__MASTER_NAME='some-master-name'
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__PASSWORD='some-password'
AIRFLOW__LOGGING__LOGGING_LEVEL=DEBUG

Note that I'm not 100% certain of the syntax for the password environment var. I can't get to the point of testing this because without TLS connections to our internal brokers are denied (because they require TLS), and with TLS it doesn't attempt a connection because of the earlier linked code.

I've verified with the reference redis-cli that the settings we use for master-name does result in a valid response and the Sentinel set-up works as expected.

Operating System

Docker (apache/airflow:2.4.3-python3.10)

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

Deployed using Nomad.

Anything else

This is my first issue with this open source project. Please let me know if there's more relevant information I can provide to follow through on this issue.

I will try to make some time available soon to see if a simple code change in the earlier mentioned file would work, but as this is my first issue here I would still have to set-up a full development environment.

Are you willing to submit PR?

If this is indeed a simple fix I'd be willing to look into making a PR. I would like some feedback on the problem first though if possible!

Code of Conduct

boring-cyborg[bot] commented 1 year ago

Thanks for opening your first issue here! Be sure to follow the issue template!

potiuk commented 1 year ago

This is just default configuration. You can override it with your own dictionary:

https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#celery-config-options

jonathanjuursema commented 1 year ago

Hey @potiuk, thanks for getting back!

I have studied the docs you've linked and did some more Googling. Based on this Stackoverflow question I've built an implementation for our usecase, also implementing this syntax for keeping the defaults.

However, when I deploy this, the deployment works and both Airflow and Celery seem happy with the config, but it still doesn't work.

When inspecting the actual runtime config in the webinterface I learn that it correctly reads the value for AIRFLOW__CELERY__CELERY_CONFIG_OPTIONS, but for AIRFLOW__CELERY__BROKER_URL it reverts back to the default. Even though we're specifying our own (Redis Sentinel) broker url in this place in our own custom dict, the debug logging shows that Celery is trying to connect using the "default" connection string in AIRFLOW__CELERY__BROKER_URL. It seems to ignore the value from AIRFLOW__CELERY__CELERY_CONFIG_OPTIONS.

(I could not find a way in the webinterface to dump/debug the actual Celery configuration (instead of what Airflow says it's forwarding to Celery). If there's a way to achieve that, please do let me know.)

potiuk commented 1 year ago

Can you post the snippets (anonymosed) of your configuration case and post links (ideally as gists) airflow logs with debugging (you will find the way to do it in the docs) and post logs here showing what's going on ?

Maybe you've made a typo or misunderstood how to configure it - but showing (or even looking closely by you) the snippets and logs should help in spotting it.

I understand you say 'we've done that's but before anyone attempts to reproduce it we need to see what you've done to be able to reproduce it and help you to diagnose it

jonathanjuursema commented 1 year ago

Oh I most certainly don't mind sharing more snippets (should've done so in the first place, was in a bit of a rush)!

While trying to reproduce the issue the situation has changed. (I'm not sure why. I didn't commit the last configuration because I couldn't get it to work, so in reproducing I've started the process again. I'll make sure to save the config this time so that we can iterate on it if needed.)

The docker containers for the worker, webserver and scheduler have the following environment variables set (all config is done via environment variables):

(airflow)printenv | grep AIRFLOW
AIRFLOW__CORE__HOSTNAME_CALLABLE=socket.gethostname
AIRFLOW__CORE__LOAD_EXAMPLES=false
AIRFLOW_INSTALLATION_METHOD=
AIRFLOW_USER_HOME_DIR=/home/airflow
AIRFLOW__SMTP__SMTP_PASSWORD=xxx
AIRFLOW__SMTP__SMTP_HOST=xxx
AIRFLOW_PIP_VERSION=22.3.1
AIRFLOW__SMTP__SMTP_USER=xxx
AIRFLOW__SMTP__SMTP_SSL=false
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL=3600
AIRFLOW_HOME=/opt/airflow
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN_CMD=/opt/airflow/airflow_construct_sql_conn_str.sh
AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=300
AIRFLOW__SMTP__SMTP_PORT=587
AIRFLOW__SMTP__SMTP_STARTTLS=true
AIRFLOW_UID=50000
AIRFLOW__API__AUTH_BACKENDS=airflow.api.auth.backend.basic_auth, airflow.api.auth.backend.session
AIRFLOW__CORE__ENABLE_XCOM_PICKLING=true
AIRFLOW__CELERY__CELERY_CONFIG_OPTIONS=retail_celery_config.CELERY_CONFIG
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
AIRFLOW__CORE__FERNET_KEY=xxx
AIRFLOW__SCHEDULER__CATCHUP_BY_DEFAULT=false
AIRFLOW__CELERY__RESULT_BACKEND_CMD=/opt/airflow/airflow_construct_dbsql_conn_str.sh
AIRFLOW__LOGGING__LOGGING_LEVEL=DEBUG
AIRFLOW__WEBSERVER__WEB_SERVER_PORT=8080
AIRFLOW_VERSION=2.4.3
AIRFLOW__SMTP__SMTP_MAIL_FROM=xxx
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=true
(airflow)printenv | grep CELERY
CELERY_SSL_ACTIVE=true
AIRFLOW__CELERY__CELERY_CONFIG_OPTIONS=retail_celery_config.CELERY_CONFIG
AIRFLOW__CELERY__RESULT_BACKEND_CMD=/opt/airflow/airflow_construct_dbsql_conn_str.sh
(airflow)printenv | grep REDIS
REDIS_BROKER_MASTER_PASSWORD=xxx
REDIS_BROKER_MASTER_NAME=xxx
REDIS_BROKER_URL=sentinel://xxx:26379;sentinel://xxx:26379;sentinel://xxx:26379

I've also mounted the following file in /opt/airflow/config/retail_celery_config.py:

from airflow.config_templates.default_celery import DEFAULT_CELERY_CONFIG
import os

CELERY_CONFIG = {
    **DEFAULT_CELERY_CONFIG,
    'broker_url': '{broker_url}?ssl_cert_reqs=none'.format(broker_url=os.getenv('REDIS_BROKER_URL')),
    'broker_transport_options': {
        'password': os.getenv('REDIS_BROKER_MASTER_PASSWORD'),
        'master_name': os.getenv('REDIS_BROKER_MASTER_NAME')
    }
}

What I now observe is interesting. I think the scheduler is working. Neither the webserver and scheduler are throwing relevant errors, and the webserver doesn't show the "scheduler hasn't run in xxx minutes" banner. If there's additional checks I can do, please let me know.

However, the worker still won't start:

[2022-12-12 15:46:10,811: ERROR/MainProcess] consumer: Cannot connect to sentinel://xxx:26379//: No master found for 'xxx'.
Will retry using next failover.

[2022-12-12 15:46:10,828: ERROR/MainProcess] consumer: Cannot connect to sentinel://xxx:26379//: No master found for 'xxx'.
Will retry using next failover.

[2022-12-12 15:46:10,840: ERROR/MainProcess] consumer: Cannot connect to sentinel://xxx:26379//: No master found for 'xxx'.
Trying again in 32.00 seconds... (16/100)

This indicates that it does fetch the right values (or at least the master name and sentinel list). Using the reference redis-cli I can validate the Redis configuration does work:

➜  src ./redis-cli -p 26379 --tls --insecure
127.0.0.1:26379> sentinel get-master-addr-by-name non-existing-master-name
(nil)
127.0.0.1:26379> sentinel get-master-addr-by-name xxx
1) "xx.xx.xx.xx"
2) "7003"

It should be noted that both the Sentinels and Redis Masters are using a non-public CA (to make things even worse). Either the scheduler/webserver accepts the ssl_cert_reqs=none from above and the worker doesn't, or the worker actually attempts an SSL connection, and the scheduler/webserver don't attempt one or don't log the attempts.

It should also be noted that we use an internally managed Redis/Sentinel cluster (which I don't control, we're just a user). However, we have various other applications deployed using Redis in the same cluster and from the same application machines (effectively using the same firewall/network path), and the other applications do work as intended, so my first hunch is that the problem is not with the Redis/Sentinel cluster itself.

For now I'm ignoring certificate validation, but if I can get this to work I'd like to mount the CA PEM and specify that in the broker URL.

Please let me know if you need any additional information, snippets or if you have further troubleshooting ideas. The help is greatly appreciated!

potiuk commented 1 year ago

Can you dump the env values from the same user that Airlfow worker is run (including making sure that the exact same entrypoint is used ?) and check if your workers have the same variables and settings mounted as scheduler/webserver?

I guess the problem is that your workers do not have the same variables set or the mounted file is not mounted there or maybe the user airlfow is run with has no permissions. I tink you can track it down via also enable debug logging for Airlfow (you can find it in the config/docs). Also when I usually debug such issues I modify the config in the way to make absolutely sure it is actually processed - for example raising an Exception with some meaningful message right after the configuration is parsed is a good way to see that it actually is - by various components.

Raising an exception in your config and seeing it in your logs will make sure you have not made any typo or problem, This is what I'd do at least if I had similar issue.

Can you please make such an exercise?

jonathanjuursema commented 1 year ago

I've spent some time playing with our set-up to tackle some of the question/challenges you set out. I have the following observations:

Is the configuration the same between the worker, webserver and scheduler? Yes. As mentioned, we deploy Airflow in a containerized setting, and all containers (webserver, scheduler and worker) are all provided environment variables from (mostly) the same central source. To doublecheck, I've ran the following command in all three containers:

printenv | grep AIRFLOW; printenv | grep REDIS; printenv | grep CELERY

Sorted and compared the output in Excel (not by eye, but by writing a bunch of if this cell equals that cell statements) and I am 100% sure all containers run the exact same environment config.

Can you make sure you are actually loading the intended configuration? I did the following. I've updated the /opt/airflow/config/retail_celery_config.py I've discussed in my previous comment like this (note the broker URL):

from airflow.config_templates.default_celery import DEFAULT_CELERY_CONFIG
import os

CELERY_CONFIG = {
    **DEFAULT_CELERY_CONFIG,
    'broker_url': 'banana',
    'broker_transport_options': {
        'password': os.getenv('REDIS_BROKER_MASTER_PASSWORD'),
        'master_name': os.getenv('REDIS_BROKER_MASTER_NAME')
    }
}

If I deploy this way, I'm observing the following:

The webserver and scheduler don't show anything weird in their logging. Their stdout looks fine, scheduler stderr is empty, and webserver stderr is below. I don't think that is related.

/home/airflow/.local/lib/python3.10/site-packages/azure/storage/common/_connection.py:82 SyntaxWarning: "is" with a literal. Did you mean "=="?
[2022-12-14 14:09:29 +0000] [30] [INFO] Starting gunicorn 20.1.0
[2022-12-14 14:09:29 +0000] [30] [INFO] Listening at: http://0.0.0.0:8080 (30)
[2022-12-14 14:09:29 +0000] [30] [INFO] Using worker: sync
[2022-12-14 14:09:29 +0000] [46] [INFO] Booting worker with pid: 46
[2022-12-14 14:09:29 +0000] [47] [INFO] Booting worker with pid: 47
[2022-12-14 14:09:29 +0000] [48] [INFO] Booting worker with pid: 48
[2022-12-14 14:09:29 +0000] [49] [INFO] Booting worker with pid: 49

The worker, however, shows the following stdout:

 -------------- celery@f616d2ff89b0 v5.2.7 (dawn-chorus)
--- ***** ----- 
-- ******* ---- Linux-5.18.0-0.deb11.4-amd64-x86_64-with-glibc2.31 2022-12-14 14:08:30
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         airflow.executors.celery_executor:0x7f29e6748ac0
- ** ---------- .> transport:   amqp://guest:**@banaan:5672//
- ** ---------- .> results:     mysql://xxx:**@xxx:3306/xxx
- *** --- * --- .> concurrency: 16 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> default          exchange=default(direct) key=default

And the following in stderr:

[2022-12-14 14:17:24,067: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@banaan:5672//: [Errno -2] Name or service not known.
Trying again in 32.00 seconds... (16/100)

[2022-12-14 14:17:56,098: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@banaan:5672//: [Errno -2] Name or service not known.
Trying again in 32.00 seconds... (16/100)

[2022-12-14 14:18:28,125: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@banaan:5672//: [Errno -2] Name or service not known.
Trying again in 32.00 seconds... (16/100)

[2022-12-14 14:19:00,160: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@banaan:5672//: [Errno -2] Name or service not known.
Trying again in 32.00 seconds... (16/100)

[2022-12-14 14:19:32,189: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@banaan:5672//: [Errno -2] Name or service not known.
Trying again in 32.00 seconds... (16/100)

[2022-12-14 14:20:04,217: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@banaan:5672//: [Errno -2] Name or service not known.
Trying again in 32.00 seconds... (16/100)

This suggests to me that at least the worker is picking up the custom config.

Other observations.

This makes me wonder, if I set the Redis config to something bogus, how come the webserver and and scheduler don't complain?

In order to investigate this I set AIRFLOW__WEBSERVER__EXPOSE_CONFIG=true (AIRFLOW__LOGGING__LOGGING_LEVEL=DEBUG has already been on since this experiment).

Now I can observe the configuration in the Airflow webinterface. This page has two sections. /opt/airflow/airflow.cfg shows the Airflow config file. This is just the default file. We don't specify this file, so we're using the one that comes the upstream Airflow container.

Under Running Configuration we can see the actual running configuration, and here I see something interesting: Section Key Value Source
celery broker_url redis://redis:6379/0 airflow.cfg
celery celery_config_options retail_celery_config.CELERY_CONFIG env var

It loads our reference custom celery config dict (as discussed earlier) from the env var. However, it also loads the broker_url from the airflow.cfg config file. Somehow, the worker looks like to use the one from our custom config dict (since the logging clearly shows the test string there). The webserver and scheduler, I think, are falling back to the default broker URL from the airflow.cfg (or at least seem to ignore our custom dict). They don't show any connection errors however (I've shared the logs above, the stdout logs don't reference the test string anywhere, nor an indication there's something wrong). According to the docs, I'd think that the environment variable file should get priority. I'm not sure why (if redis://redis:6379/0 does not exist) the webserver and scheduler seem to work fine.

I've also searched in our log aggregator (the container UI is not the best one for investigating logs older than a few minutes) for the test string, and for the string redis. The first only shows log lines from the worker container (the ones I shared above), the second one shows the following:

Date,Host,Service,Container Name,Message
"2022-12-14T13:49:07.140Z","""vmXXXX""","""airflow""","""airflow-init-5df407da-2388-dc6f-15be-78cec8708021""","[2022-12-14 13:49:07,140] {providers_manager.py:433} DEBUG - Loading EntryPoint(name='provider_info', value='airflow.providers.redis.get_provider_info:get_provider_info', group='apache_airflow_provider') from package apache-airflow-providers-redis"
"2022-12-14T13:49:11.403Z","""vmXXXX""","""airflow""","""airflow-init-5df407da-2388-dc6f-15be-78cec8708021""","[2022-12-14 13:49:11,402] {providers_manager.py:433} DEBUG - Loading EntryPoint(name='provider_info', value='airflow.providers.redis.get_provider_info:get_provider_info', group='apache_airflow_provider') from package apache-airflow-providers-redis"
"2022-12-14T13:49:37.692Z","""vmXXXX""","""airflow""","""airflow-webserver-5df407da-2388-dc6f-15be-78cec8708021""","[2022-12-14 13:49:37,692] {providers_manager.py:433} DEBUG - Loading EntryPoint(name='provider_info', value='airflow.providers.redis.get_provider_info:get_provider_info', group='apache_airflow_provider') from package apache-airflow-providers-redis"
"2022-12-14T13:49:43.940Z","""vmXXXX""","""airflow""","""airflow-webserver-5df407da-2388-dc6f-15be-78cec8708021""","[2022-12-14 13:49:43,940] {providers_manager.py:433} DEBUG - Loading EntryPoint(name='provider_info', value='airflow.providers.redis.get_provider_info:get_provider_info', group='apache_airflow_provider') from package apache-airflow-providers-redis"
"2022-12-14T13:49:47.090Z","""vmXXXX""","""airflow""","""airflow-webserver-5df407da-2388-dc6f-15be-78cec8708021""","[2022-12-14 13:49:47,088] {providers_manager.py:433} DEBUG - Loading EntryPoint(name='provider_info', value='airflow.providers.redis.get_provider_info:get_provider_info', group='apache_airflow_provider') from package apache-airflow-providers-redis"
"2022-12-14T14:02:36.727Z","""vmXXXX""","""airflow""","""airflow-worker-8860332e-07cf-20dc-19b1-56c4ba462531""","File ""/home/airflow/.local/lib/python3.10/site-packages/redis/client.py"", line 1378, in ping"
"2022-12-14T14:02:36.727Z","""vmXXXX""","""airflow""","""airflow-worker-8860332e-07cf-20dc-19b1-56c4ba462531""","File ""/home/airflow/.local/lib/python3.10/site-packages/redis/client.py"", line 898, in execute_command"
"2022-12-14T14:02:36.727Z","""vmXXXX""","""airflow""","""airflow-worker-8860332e-07cf-20dc-19b1-56c4ba462531""","File ""/home/airflow/.local/lib/python3.10/site-packages/redis/connection.py"", line 1192, in get_connection"
"2022-12-14T14:02:36.727Z","""vmXXXX""","""airflow""","""airflow-worker-8860332e-07cf-20dc-19b1-56c4ba462531""","File ""/home/airflow/.local/lib/python3.10/site-packages/redis/sentinel.py"", line 44, in connect"
"2022-12-14T14:02:36.727Z","""vmXXXX""","""airflow""","""airflow-worker-8860332e-07cf-20dc-19b1-56c4ba462531""","File ""/home/airflow/.local/lib/python3.10/site-packages/redis/sentinel.py"", line 106, in get_master_address"
"2022-12-14T14:02:36.727Z","""vmXXXX""","""airflow""","""airflow-worker-8860332e-07cf-20dc-19b1-56c4ba462531""","File ""/home/airflow/.local/lib/python3.10/site-packages/redis/sentinel.py"", line 219, in discover_master"
"2022-12-14T14:04:13.228Z","""vmXXXX""","""airflow""","""airflow-init-6686130f-cb74-c46b-2bff-a81c723030ea""","[2022-12-14 14:04:13,227] {providers_manager.py:433} DEBUG - Loading EntryPoint(name='provider_info', value='airflow.providers.redis.get_provider_info:get_provider_info', group='apache_airflow_provider') from package apache-airflow-providers-redis"
"2022-12-14T14:04:17.405Z","""vmXXXX""","""airflow""","""airflow-init-6686130f-cb74-c46b-2bff-a81c723030ea""","[2022-12-14 14:04:17,405] {providers_manager.py:433} DEBUG - Loading EntryPoint(name='provider_info', value='airflow.providers.redis.get_provider_info:get_provider_info', group='apache_airflow_provider') from package apache-airflow-providers-redis"
"2022-12-14T14:04:45.321Z","""vmXXXX""","""airflow""","""airflow-webserver-6686130f-cb74-c46b-2bff-a81c723030ea""","[2022-12-14 14:04:45,320] {providers_manager.py:433} DEBUG - Loading EntryPoint(name='provider_info', value='airflow.providers.redis.get_provider_info:get_provider_info', group='apache_airflow_provider') from package apache-airflow-providers-redis"
"2022-12-14T14:04:52.018Z","""vmXXXX""","""airflow""","""airflow-webserver-6686130f-cb74-c46b-2bff-a81c723030ea""","[2022-12-14 14:04:52,018] {providers_manager.py:433} DEBUG - Loading EntryPoint(name='provider_info', value='airflow.providers.redis.get_provider_info:get_provider_info', group='apache_airflow_provider') from package apache-airflow-providers-redis"
"2022-12-14T14:04:55.988Z","""vmXXXX""","""airflow""","""airflow-webserver-6686130f-cb74-c46b-2bff-a81c723030ea""","[2022-12-14 14:04:55,987] {providers_manager.py:433} DEBUG - Loading EntryPoint(name='provider_info', value='airflow.providers.redis.get_provider_info:get_provider_info', group='apache_airflow_provider') from package apache-airflow-providers-redis"
"2022-12-14T14:08:44.904Z","""vmXXXX""","""airflow""","""airflow-init-b75ff09a-cfc0-dad0-0ece-8d3bdcda9553""","[2022-12-14 14:08:44,904] {providers_manager.py:433} DEBUG - Loading EntryPoint(name='provider_info', value='airflow.providers.redis.get_provider_info:get_provider_info', group='apache_airflow_provider') from package apache-airflow-providers-redis"
"2022-12-14T14:08:49.385Z","""vmXXXX""","""airflow""","""airflow-init-b75ff09a-cfc0-dad0-0ece-8d3bdcda9553""","[2022-12-14 14:08:49,384] {providers_manager.py:433} DEBUG - Loading EntryPoint(name='provider_info', value='airflow.providers.redis.get_provider_info:get_provider_info', group='apache_airflow_provider') from package apache-airflow-providers-redis"
"2022-12-14T14:09:18.278Z","""vmXXXX""","""airflow""","""airflow-webserver-b75ff09a-cfc0-dad0-0ece-8d3bdcda9553""","[2022-12-14 14:09:18,278] {providers_manager.py:433} DEBUG - Loading EntryPoint(name='provider_info', value='airflow.providers.redis.get_provider_info:get_provider_info', group='apache_airflow_provider') from package apache-airflow-providers-redis"
"2022-12-14T14:09:24.433Z","""vmXXXX""","""airflow""","""airflow-webserver-b75ff09a-cfc0-dad0-0ece-8d3bdcda9553""","[2022-12-14 14:09:24,433] {providers_manager.py:433} DEBUG - Loading EntryPoint(name='provider_info', value='airflow.providers.redis.get_provider_info:get_provider_info', group='apache_airflow_provider') from package apache-airflow-providers-redis"
"2022-12-14T14:09:29.312Z","""vmXXXX""","""airflow""","""airflow-webserver-b75ff09a-cfc0-dad0-0ece-8d3bdcda9553""","[2022-12-14 14:09:29,311] {providers_manager.py:433} DEBUG - Loading EntryPoint(name='provider_info', value='airflow.providers.redis.get_provider_info:get_provider_info', group='apache_airflow_provider') from package apache-airflow-providers-redis"

Looking forward to your observations! Do let me know if there's any more information I can provide. :)

potiuk commented 1 year ago

Ok. I think the problem is sentinel.

Simply it looks like Celery does not work with sentinel.

I hope someone who knows celery more will be able to find a solution for this. Seems that others also started to report problems with sentinel usage. I know Celery used to not support sentinel and it required special registration and https://celery-redis-sentinel.readthedocs.io/en/latest/ had to be used to support sentinel.

But here is where my knowledge ends - I will direct others who have the same problem and maybe they will be able to find some solutions.

potiuk commented 1 year ago

Few more things. Just adding to the above - which might be bad guess.

The fact that nether webserver nor scheduler fails is that a) webserver does not connect to redis at all b) scheduler will only do it when scheduliing tasks via celery executor - while workers are trying to connect to them as consumer.

I believe your configuration is passed properly - but some configuration of celery/networking/DNS/firewall simply make the attempts to instances of redis fail.

This error is quite clear about it:

[2022-12-14 14:18:28,125: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@banaan:5672//: [Errno -2] Name or service not known.

Indicates that client cannot resolve the "banaan" name - which might simply mean that your banaan address cannot be resolved by Airflow worker..

Now - I do not know what are your xxx in the configurationsm but for me you are facing a problem with networking/DNS, not with the client.

So I guess this issue description was wrong. It's likely NOT about SSL parameters to pass, it's deploymet issue you have most likely @jonathanjuursema

jonathanjuursema commented 1 year ago

Hey! Thanks for checking back.

I think we're confusing two things here.

The test with the banana hostname was in response to the following challenge:

Raising an exception in your config and seeing it in your logs will make sure you have not made any typo or problem

I've set the hostname to something non-existent, to validate it does indeed load the configuration. Because the worker throws the "name or service not known" error, it seems to correctly load the (invalid) config. The scheduler and web server however, don't seem to, as they would otherwise also have thrown this or a similar exception.

Just to be sure, I've also validated name resolution with the actual domain names that I'm having the problems with (the internal information redacted with xxx). This works as expected:

(airflow)host xxx
xxx has address 10.120.xxx.xxx

Finally, I'm positive we can rule out firewalling. There are other applications running on the same virtual machines that can access those Redis Sentinal instances just fine. Our Redis firewalls are configured to accept all incoming connections from the virtual machines Airflow (and those other apps) are running on.

potiuk commented 1 year ago

I hope then someone using sentinel can further debug it and solve.

dintorf commented 1 year ago

I have some snippets in Airflow Issue #28655. This worked for me locally, but I have yet to test it elsewhere. If this works for anyone else, I will submit the example to the documentation.