Open quangvnm opened 3 months ago
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
Crossed-post: redis/redis-py#3353 celery/kombu#2096
And a very helpful response from discuss.python.org
Happens to us for some time, too. The same AttributeError, the same inability to work until you restart the pod.
That's quite a coincidence: there was an another bug in kombu itself that affected redis deployments (https://github.com/celery/kombu/pull/2007). IIRC, it affected a couple Airflow versions rendering them unstable.
Now, IMO, the usability problem is the same in both cases: container doesn't even crash, thus k8s can't auto-solve it by restarting the pod.
I also have experience this in k8s deployment. Would only be fix by manual intervention.
According to https://github.com/celery/kombu/pull/2007#issuecomment-2248133561 The issue is fixed in Celery v5.5.0b1 so we might want to bump min version for Celery to avoid this issue for everyone.
Can anyone confirm if upgrading to Celery 5.5.0 solves the problem? (Celery 5.5.0 not released yet, but rc1 is averrable so if anyone can confirm this solves the issue by checking it in test enviroment that would be great)
If I'm not mistaken, I did try it when kombu 5.4.0 was released, and it didn't help. As far as I can tell, this issue is related to Airflow's celery provider only: when I upped the timeout value from 1 to 300, it stopped occurring.
Anyway, it's rather strange to me that this value is so small by default: scheduler container does not only schedule tasks, it also runs DAG analyzer, and in a real environment with cpu limit, or with badly written DAGs, analysis could easily take more than 1s, making send/fetch operations unsuccessful every now and then.
UPD: Please don't blindly believe me, though. We run custom build/deploy configuration, so it's very much possible that it's our local problem in this instance.
Apache Airflow version
2.9.3
If "Other Airflow 2 version" selected, which one?
No response
What happened?
Hello,
This is a copy-paste of an issue we also post on celery/kombu and redis/redis-py. We include a MCVE example at the end of this post.
TL;DR: The
with timeout(seconds=OPERATION_TIMEOUT):
inairflow.executors.celery_executor.send_task_to_executor
might leave a very broken import of redis & may affect another package that we haven't discovered yet. This is a race condition and very hard to debug at first. To reproduce this bug, we have toRelates:
Our environment:
We 've observed this issue since at least several months ago with our airflow deployment using official helm chart, we have the same issue as in related issues/discussion:
We've verified and we have neither redis folder nor
redis.py
file from our dev, this is a very sporadic error where most of the time it works, then it stops working for unknown reason, and once if happens, the scheduler is broken and couldn't schedule anything (same error message) until we restart the scheduler process (restart the pod)This happens quite randomly (one in tens or fifty deployments of helm chart), and we couldn't reproduce it for sure for debugging purpose.
What we found out is that if this happens, this bug won't disappear until we restart (kill) the scheduler pod. We could reproduce randomly with these steps in a test airflow:
kubectl delete po -l release=airflow-XXX,component=scheduler --force --grace-period 0
At first, we suspect that this is a case of race condition in importing redis package, because we inject debug code before the line
class PrefixedRedisPipeline(GlobalKeyPrefixMixin, redis.client.Pipeline):
withprint(sys.path)
,print(redis)
,print(redis.__version__)
, ... and everything is okay, exceptprint(dir(redis))
gives a different result:compared to a python shell session inside the same container:
We noted that
dir(redis)
inside the troublesome scheduler lacks several attributes, notablyredis.client
Another thing we discovered is that in every case, there is always a Timeout (as you could see the log above), and sure enough, we found out later that the bug always happens while the process of importing
redis
is interrupted by Timeout (we print line number inredis/__init__.py
and the importing didn't run till the end). In very rare case,airflow.utils.timeout
doesn't work as inteded, the timeout error is printed out in the middle ofimport redis
but theimport redis
still run till the end, in this case, the bug couldn't happens. But most of the time, the timeout interrupt the import.With this idea, we injected a
sleep
at the end ofredis/__init__.py
and sure enough, we could reproduce this bug every time.So an interrupted import give a different import than a normal
import
, it seems that the broken import doesn't import not-public member in package, such as redis.client in this case, Redis, StrictRedis are exposed explicitly but redis.client is set "impliciteley"I've found one comment from discuss.python.org:
In our case, if we try to reimport an interrupted import, this isn't true anymore, the submodule isn't set at all. We didn't dig further in internal python to find out why this happens.
We see at least four options to fix this bug:
from redis import client
toredis/__init__.py
kombu/transport/redis.py
withand replace every
redis.client
byclient
We opt for the second method in our dev at the moment
We aren't sure that this bug happens enough to be taken into consideration in upstream? But at least other dev won't loose days of debugging session as us ^^
This raise another question: Could the Timeout or another mechanism break the import and introduce this bug in another package or another hard-to-catch race condition bug?
What you think should happen instead?
The airflow timeout should not leave a broken import. We are not sure if this should be addressed to redis/redis-py though, we post the same issue in celery/kombu and redis/redis-py.
How to reproduce
This is a race condition which is hard to duplicate in local machine, so what we did is to introduce a delay to the very first import of redis to reproduce this bug for sure.
Please find below a compressed file of three file:
redis-py
package's__init__.py
to add the delay in import mcve.zippython -m mock_airflow
gives the result:The important line is
'client' in dir(redis): False
event after a reimport of failed import and if we uncomment the line ofprint(redis.client)
, it will raise an error:If we inject
from redis import client
toredis/__init__.py
, the mcve gives the diffent output:(Changes:
client' in dir(redis): True
andgetattr(redis, 'client', None)=
)Operating System
Debian GNU/Linux 12 (bookworm)
Versions of Apache Airflow Providers
apache-airflow-providers-celery==3.7.2 apache-airflow-providers-redis==3.7.1
Deployment
Official Apache Airflow Helm Chart
Deployment details
Helm: 3.11.3 Helm chart: apache/airflow:1.15.0 k8s: 1.28
Anything else?
A race condition which happen very rarely at the start of scheduler service, but as we deploy to dev/staging alot each day, this happens several time a day. Once it happens, it won't go away until restart of this scheduler service. Condition:
Are you willing to submit PR?
Code of Conduct