Closed luederm closed 7 months ago
We are running 2.8.1 and all is fine in this area on our side. From your descriptions I assume you can re-produce the error also with a fresh/clean setup? Have you tried one of the example DAGs delivered by Airflow? Or is the problem specific to your DAGs that you use (from previous version)? If it is on your DAGs maybe you need to post them. Otherwise, can you please post a bit more of the logs of the error? That is nothing I have seen recently.
Hello @jscheffl, thanks for looking into this.
From your descriptions I assume you can re-produce the error also with a fresh/clean setup?
Yep, I tested with a fresh conda environment and database.
Have you tried one of the example DAGs delivered by Airflow? Or is the problem specific to your DAGs that you use (from previous version)?
Yes, I tested with both the python operator example DAG and the bash operator example DAG.
Otherwise, can you please post a bit more of the logs of the error? That is nothing I have seen recently.
Here is the log from when I triggered the DAG to when the DAG run finshed, using the example bash operator: 2_8_1_standalone_example_error.log
I tried to reproduce it with venv and could not. I guess the problem is with mixing of conda packages and pip
packages - looks like you have conda python abi installed (which is the way how google-re2 compiled library talks to Python) and I guess this interferes with the pip installed packages. I guess the way how airflow forks processes, and uses shared memory to communicate, might interface with it.
You can do two things to verify this hypothesis.
you will find both values to replace by looking in your log and finding 'airflow', 'tasks', 'run':
airflow tasks run example_bash_operator this_will_skip PUT_RUN_ID_OF_A_TASK_YOU_RUN_HERE \
--local --subdir PATH_TO_YOUR_EXAMPLE_DAG
This one should run and print something like:
[2024-02-05T18:07:59.543+0100] {task_command.py:423} INFO - Running <TaskInstance: example_bash_operator.this_will_skip manual__2024-02-05T16:44:54+00:00 [skipped]> on host jaroslaws-macbook-pro.local
python -m venv
and repeat the standarlone tests. It should work flawlessly (worked when I created everything from scratch).Can you please send us back result of those experiments?
I ran the task manually as you suggested. I didn't see much in the log output besides for:
INFO - Running <TaskInstance: example_bash_operator.this_will_skip manual__2024-02-05T14:07:31+00:00 [failed]> on host C********
It also looks like it failed in the UI.
I installed python through venv with my system's python (3.9)
/usr/bin/python3 -m venv envs/2.8.1
source envs/2.8.1/bin/activate
pip install "apache-airflow[async,celery,crypto,jdbc,ldap,password,mysql,postgres,redis,s3,sftp,ssh,slack]==2.8.1" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.9.txt"
One notable thing is that when using venv, google-re2 installed through pip without error. That was not the case with my conda environment, which is why I had installed it through conda instead of pip. However, now I just get a new error when I try to access the UI:
standalone | Airflow Standalone is for development purposes only. Do not use this in production!
webserver | [2024-02-05 14:46:57 -0500] [3621] [ERROR] Worker (pid:3653) was sent SIGKILL! Perhaps out of memory?
webserver | [2024-02-05 14:46:57 -0500] [3678] [INFO] Booting worker with pid: 3678
webserver | [2024-02-05 14:46:58 -0500] [3621] [ERROR] Worker (pid:3655) was sent SIGKILL! Perhaps out of memory?
webserver | [2024-02-05 14:46:58 -0500] [3686] [INFO] Booting worker with pid: 3686
webserver | [2024-02-05 14:46:58 -0500] [3621] [ERROR] Worker (pid:3656) was sent SIGKILL! Perhaps out of memory?
webserver | [2024-02-05 14:46:58 -0500] [3621] [ERROR] Worker (pid:3657) was sent SIGKILL! Perhaps out of memory?
webserver | [2024-02-05 14:46:58 -0500] [3689] [INFO] Booting worker with pid: 3689
webserver | [2024-02-05 14:46:58 -0500] [3690] [INFO] Booting worker with pid: 3690
webserver | [2024-02-05 14:47:03 -0500] [3621] [ERROR] Worker (pid:3690) was sent SIGKILL! Perhaps out of memory?
webserver | [2024-02-05 14:47:03 -0500] [3705] [INFO] Booting worker with pid: 3705
Memory doesn't seem like the problem, as I still have an ample amount when these errors are thrown.
One notable thing is that when using venv, google-re2 installed through pip without error.
Of course. This will also work in conda as soon as conda maintainers will stop giving compilers MacOS 10.9 system librarires to use to build packages (End of Life of which were 7 years ago). This is basically why installing google-re2 fails on Conda because it forces it to be build using system libraries that had last update 7 years go. Pip does it properly - using the libraries of the current system you are on. And google-re2 has been first released after 10.9 reached EOL.
https://github.com/conda-forge/conda-forge.github.io/issues/1844
If you want you can even comment in the issue above. We always recommend to use pip.
However, now I just get a new error when I try to access the UI:
I also have no such problem.
Having your workers killed indicates that you have something wrong in your system that triggers some errors. You have not mentioned if you have ARM/ M1/2/3 but I guess so. Then use Python 3.10 at least. A number of libraries for Python 3.9 has worse support for ARM, so you are way better to use higher python version and it might be it causes some problems on your system. The problem is that when task is killed with SIGKILL, whoever/whatever kills it, gives it no chance to write anything to the log, so we will not fiind out what killed it. This might be related to some specifics of your environment - for example the way how you installed Python (was it conda?). Various ways of installing python might have some problems and I would not be surprised if conda is causing it.
For webserver you can also try different options of gunicorn starting (see configuration) , some of them might cause problems if system libraries got modified
Generally - make sure you have whole environment set outside of conda and you should be good..
Oh, interesting. I will keep that in mind next time I have compilation issues in a conda environment.
You have not mentioned if you have ARM/ M1/2/3 but I guess so. Then use Python 3.10 at least.
Actually, my MacBook has an intel chip.
This might be related to some specifics of your environment - for example the way how you installed Python (was it conda?).
The python I used with the venv was not installed with conda. I think it was installed with a MacOS update.
Since you mentioned using at least python 3.10, I tested airflow 2.8.1 again but this time creating a venv with python 3.11, which was installed with brew. The error I get when running the example bash operator is the same as the original error:
scheduler | [2024-02-06T11:33:33.663-0500] {taskinstance.py:2700} ERROR - Executor reports task instance <TaskInstance: example_bash_operator.runme_0 manual__2024-02-06T16:33:26+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?
I also tried other airflow versions and notice the error starts at 3.7.0.
Also, I should have checked this earlier but I found crash reports in the Mac Console:
-------------------------------------
Translated Report (Full Report Below)
-------------------------------------
Process: Python [55375]
Path: /usr/local/Cellar/python@3.11/3.11.6_1/Frameworks/Python.framework/Versions/3.11/Resources/Python.app/Contents/MacOS/Python
Identifier: org.python.python
Version: 3.11.6 (3.11.6)
Code Type: X86-64 (Native)
Parent Process: Python [54935]
Responsible: pycharm [6073]
Date/Time: 2024-02-06 14:34:39.7479 -0500
OS Version: macOS 13.6 (22G120)
Report Version: 12
Bridge OS Version: 8.1 (21P1069)
Time Awake Since Boot: 320000 seconds
Time Since Wake: 24719 seconds
System Integrity Protection: enabled
Crashed Thread: 0 Dispatch queue: com.apple.main-thread
Exception Type: EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x0000000000000000
Exception Codes: 0x0000000000000001, 0x0000000000000000
Termination Reason: Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process: exc handler [55375]
VM Region Info: 0 is not in any region. Bytes before following region: 4391043072
REGION TYPE START - END [ VSIZE] PRT/MAX SHRMOD REGION DETAIL
UNUSED SPACE AT START
--->
__TEXT 105ba0000-105ba4000 [ 16K] r-x/r-x SM=COW .../MacOS/Python
Application Specific Information:
*** multi-threaded process forked ***
crashed on child side of fork pre-exec
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libresolv.9.dylib 0x7ff81b06bdab dns_res_send + 3655
1 libresolv.9.dylib 0x7ff81b06af53 res_9_nsend_2 + 32
2 libresolv.9.dylib 0x7ff81b06a04c res_nquery_soa_min + 295
3 libresolv.9.dylib 0x7ff81b07304d _pdns_query + 145
4 libresolv.9.dylib 0x7ff81b072b13 _sdns_search + 732
5 libresolv.9.dylib 0x7ff81b0731fb dns_search + 158
6 libkrb5.3.3.dylib 0x108b31ddf krb5int_dns_init + 188
7 libkrb5.3.3.dylib 0x108b321fe k5_make_uri_query + 129
8 libkrb5.3.3.dylib 0x108b36a65 locate_server + 1359
9 libkrb5.3.3.dylib 0x108b364b2 k5_locate_server + 73
10 libkrb5.3.3.dylib 0x108b37c01 krb5_sendto_kdc + 201
11 libkrb5.3.3.dylib 0x108b13a1f krb5_tkt_creds_get + 212
12 libkrb5.3.3.dylib 0x108b14671 krb5_get_credentials + 113
13 libgssapi_krb5.2.2.dylib 0x108a00f10 krb5_gss_init_sec_context_ext + 2663
14 libgssapi_krb5.2.2.dylib 0x108a01559 krb5_gss_init_sec_context + 63
15 libgssapi_krb5.2.2.dylib 0x1089f0560 gss_init_sec_context + 497
16 libpq.5.dylib 0x10888b77d pqsecure_open_gss + 901
17 libpq.5.dylib 0x108877caf PQconnectPoll + 3417
18 libpq.5.dylib 0x1088752b0 connectDBComplete + 284
19 libpq.5.dylib 0x1088753db PQconnectdb + 36
20 _psycopg.cpython-311-darwin.so 0x108821dc4 conn_connect + 244
21 _psycopg.cpython-311-darwin.so 0x1088236a9 connection_init + 361
22 Python 0x106147975 type_call + 128
23 Python 0x1060ee998 _PyObject_MakeTpCall + 126
24 Python 0x1060efc80 _PyObject_CallFunctionVa + 295
25 Python 0x1060eff9a _PyObject_CallFunction_SizeT + 149
26 _psycopg.cpython-311-darwin.so 0x108830405 psyco_connect + 213
27 Python 0x10613240f cfunction_call + 50
28 Python 0x1060ef60a _PyObject_Call + 122
29 Python 0x1061c585e _PyEval_EvalFrameDefault + 54618
30 Python 0x1061c8082 _PyEval_Vector + 92
31 Python 0x1060ef32c _PyVectorcall_Call + 134
32 Python 0x1061c585e _PyEval_EvalFrameDefault + 54618
33 Python 0x1061c8082 _PyEval_Vector + 92
34 Python 0x1060f1dff method_vectorcall + 344
35 Python 0x1060ef32c _PyVectorcall_Call + 134
36 Python 0x1061c585e _PyEval_EvalFrameDefault + 54618
37 Python 0x1061c8082 _PyEval_Vector + 92
38 Python 0x1060eebe3 _PyObject_FastCallDictTstate + 87
39 Python 0x10614f066 slot_tp_init + 185
40 Python 0x106147975 type_call + 128
41 Python 0x1060ee998 _PyObject_MakeTpCall + 126
42 Python 0x1061c3592 _PyEval_EvalFrameDefault + 45710
43 Python 0x1061c8082 _PyEval_Vector + 92
44 Python 0x1060eec63 _PyObject_FastCallDictTstate + 215
45 Python 0x10614f066 slot_tp_init + 185
46 Python 0x106147975 type_call + 128
47 Python 0x1060ee998 _PyObject_MakeTpCall + 126
48 Python 0x1061c3592 _PyEval_EvalFrameDefault + 45710
49 Python 0x1061c8082 _PyEval_Vector + 92
50 Python 0x1060ef32c _PyVectorcall_Call + 134
51 Python 0x1061c585e _PyEval_EvalFrameDefault + 54618
52 Python 0x1061c8082 _PyEval_Vector + 92
53 Python 0x1060ef32c _PyVectorcall_Call + 134
54 Python 0x1061c585e _PyEval_EvalFrameDefault + 54618
55 Python 0x1061c8082 _PyEval_Vector + 92
56 Python 0x1061c585e _PyEval_EvalFrameDefault + 54618
57 Python 0x1061c8082 _PyEval_Vector + 92
58 Python 0x1060f1e52 method_vectorcall + 427
59 Python 0x1061c585e _PyEval_EvalFrameDefault + 54618
60 Python 0x1061b77a5 PyEval_EvalCode + 191
61 Python 0x1062105ec run_eval_code_obj + 72
62 Python 0x10621057c run_mod + 96
63 Python 0x106212527 PyRun_StringFlags + 100
64 Python 0x10621248b PyRun_SimpleStringFlags + 69
65 Python 0x10622a838 pymain_run_command + 134
66 Python 0x10622a305 Py_RunMain + 302
67 Python 0x10622b416 Py_BytesMain + 42
68 dyld 0x7ff80af6741f start + 1903
I am using postgres for the Airflow DB. Does your setup still work using postgres?
Yes it's known issue for Python (not for airflow) to have broken threading behaviour, specifically when you are using proxies https://github.com/apache/airflow/discussions/24463 - you can look for similar issues and various workarounds worked for various people's and library problems. One of such solutions was forcing "no proxy" (look in the issue). Also you can try to set this config https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#execute-tasks-new-python-interpreter - it halped a number of people as well.
But it is very random and only happens for some people, so we cannot do much about it.
Similar issue here: https://bugs.python.org/issue24273 - but there might be other reasons why you get sigsegv - mostly it's because something in your environen
Apache Airflow version
2.8.1
If "Other Airflow 2 version" selected, which one?
No response
What happened?
After upgrading from airflow version 2.6 to 2.8, all DAGs I trigger through the UI fail immediately with the error:
ERROR - Executor reports task instance <TaskInstance: ... [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?
I tested this with python version 3.9 and 3.11, and the issue persisted with both versions.
DAGs can be successfully ran by running them directly with
.test()
What you think should happen instead?
No response
How to reproduce
On MacOS, create a conda environment with python 3.11:
conda create -n airflow-2-8-1 python=3.11
Activate the environment and install google-re through conda:
conda activate airflow-2-8-1
conda install google-re2
Install airflow:
pip install "apache-airflow[async,celery,crypto,jdbc,ldap,password,mysql,postgres,redis,s3,sftp,ssh,slack]==2.8.1" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt"
Set up airflow config and database.
Run airflow:
airflow standalone
Trigger an example DAG.
Operating System
MacOS Ventura 13.6
Versions of Apache Airflow Providers
No response
Deployment
Other
Deployment details
airflow was installed in a stand-alone conda environment. The command used to install airflow (when testing with python 3.11) is:
pip install "apache-airflow[async,celery,crypto,jdbc,ldap,password,mysql,postgres,redis,s3,sftp,ssh,slack]==2.8.1" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt"
The package google-re2 was installed through conda prior to running this command because it fails to install through pip.
I created a new airflow database for testing the deployment. I tried running the airflow server with
airflow standalone
and by running each component separately.I have Airflow configured to use the LocalExecutor. I can successfully trigger tasks with the SequentialExecutor.
Anything else?
Output from conda list:
Are you willing to submit PR?
Code of Conduct