Closed meetri closed 2 years ago
Thanks for opening your first issue here! Be sure to follow the issue template!
Not sure how to or if it's worth making a pull request for this, but here is the change I made to get airflow scheduler back working.
diff --git a/airflow/callbacks/callback_requests.py b/airflow/callbacks/callback_requests.py
index 8112589cd..ecf0ae1c2 100644
--- a/airflow/callbacks/callback_requests.py
+++ b/airflow/callbacks/callback_requests.py
@@ -16,6 +16,7 @@
# under the License.
import json
+from datetime import date, datetime
from typing import TYPE_CHECKING, Optional
if TYPE_CHECKING:
@@ -76,7 +77,13 @@ class TaskCallbackRequest(CallbackRequest):
def to_json(self) -> str:
dict_obj = self.__dict__.copy()
dict_obj["simple_task_instance"] = dict_obj["simple_task_instance"].__dict__
- return json.dumps(dict_obj)
+
+ def datetime_serializer(obj):
+ if isinstance(obj, (datetime, date)):
+ return obj.isoformat()
+ raise TypeError(f"Type {type(obj)} not serializable")
+
+ return json.dumps(dict_obj, default=datetime_serializer)
@classmethod
def from_json(cls, json_str: str):
Could you please make a PR + test with it so that we could discuss it there @meetri ?
I have the same issue using the CeleryExecutor but not the KubernetesExecutor, the issue still persist after upgrading to 2.3.3
@nicolamarangoni - It's very hard to comment on your "I have the same issue" without any details (because I can only guess what kind of error you have). You say "the same" but it might be very well different issue with some similiarities. Commenting "I have the same issue" generally adds exactly 0 value if not accompanied by some evidence. It brings no-one any closer to explaining the mistake people do or diagnosing and solving it.
It actually serves no purpose whatsoever, except slightly annoying the people looking at it because there is someone who could help with diagnosing and fixing and issue and yet, the only thing the person does is to complain that they have the same issue.
Please - if you want any other action from your comment - provide some logs, and circumstances where it happened for you @nicolamarangoni
@potiuk I have several pods with AirFlow 2.3.3. In some of them I set the KubernetesExecutor, in some others the CeleryExecutor with 2 Workers. Some of the Pods with Celery looks fine but have max 100 DAGs and very few concurrently running DAGs (maybe max 2-3 DAGs). The pods with the CeleryExecutor and several DAGs (> 150) on the other hand have the Scheduler crashing with the same error message and the same error stack as @meetri wrote. I cannot tell how many concurrent DAGs/Jobs would be running on those pods because the scheduler crashes right after importing the DAGs. Which other information would be useful for analysis?
As usual. Dags that fail, Logs. what investigation you have done so far. Which dags are failing. And most of all - HOW DO YOU KNOIW IT's THE SAME ERROR ? We aren't even sure what is the root cause of this one, I am working on Airlow for 4 years and I would definitely not be able to assess that something is "the same" even if errror message is the same. Did you compare the stack trace ? Is thie same down to single line as the one reported here? Are the DAGs identical?
Almost by definition if you have different version of Airlfow, the error cannot be "the same", because there is very likely diferent code an path the code follows. If you based your assesment (I guiess) that you saw "datetime is not serializable", there are about 50 places in the code (I am wild guessing) where such error might happen and probably you have to multiply it by a number of various inputs.
By not providing those evidences (ideally in separate issue) you seem to "know better" and give the people who know Airflow no chance to be able to asses if it is the same error, different error or maybe your mistake. The worst thing that can happen we will mark it as duplicate. The best thing that happens when you are just writing "I have the same issue" without any evidences is that you annoy people., but it brings no-one any closer to helping you to solve your issue (yes, you have to remember this is YOUR issue, that people here - often in their free time - are trying to help you to solve your problem. This is not a helpdesk. You don''t demand answers here or urge people. You provide a helpful information that people looking here might use to help you.
Please watch my talk here - you might understand more about Exercising your empathy
going thru the same issue with 2.3.2 and CeleryKubernetesExecutor. Even after switching all our from datetime import datetime
from all our dags, this issue still occurred every now and then.
going thru the same issue with 2.3.2 and CeleryKubernetesExecutor. Even after switching all our
from datetime import datetime
from all our dags, this issue still occurred every now and then.
Same comment here. If you don't provide any evidencs, logs, details, adding a comment adds no real value and does not make us closer to solving the probem @jensenity. Please. pretty please.
This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.
This issue has been closed because it has not received response from the issue author.
Apache Airflow version
2.3.2 (latest released)
What happened
The scheduler crashes with the following exception. Once the scheduler crashes restarts will cause it to immediately crash again. To get scheduler back working. All dags must be paused and all tasks that are running need to have it's state changed to up for retry. This is something we just started noticing after switching to the CeleryKubernetesExecutor.
What you think should happen instead
The error itself seems like a minor issue and should not happen and easy to fix. But what seems like a bigger issue is how the scheduler was not able to recover on it's own and was stuck in an endless restart loop.
How to reproduce
I'm not sure of the most simple step by step way to reproduce. But the conditions of my airflow workflow was about 4 active dags chugging through with about 50 max active runs and 50 concurrent each, with one dag set with 150 max active runs and 50 concurrent. ( not really that much )
The dag with the 150 max active runs is running the kubernetesExecutor create a pod in the local kubernetes environment. this I think is the reason we're seeing this issue all of a sudden.
Hopefully this helps in potentially reproducing it.
Operating System
Debian GNU/Linux 10 (buster)
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==3.4.0 apache-airflow-providers-celery==2.1.4 apache-airflow-providers-cncf-kubernetes==4.0.2 apache-airflow-providers-ftp==2.1.2 apache-airflow-providers-http==2.1.2 apache-airflow-providers-imap==2.2.3 apache-airflow-providers-postgres==4.1.0 apache-airflow-providers-redis==2.0.4 apache-airflow-providers-sqlite==2.1.3
Deployment
Other Docker-based deployment
Deployment details
we create our own airflow base images using the instructions provided on your site, here is a snippet of the code we use to install
We then use this docker image for all of our airflow workers, scheduler, dagprocessor and airflow web This is managed through a custom helm script. Also we have incorporated the use of pgbouncer to manage db connections similar to the publicly available helm charts
Anything else
The problem seems to occur quite frequently. It makes the system completely unusable.
Are you willing to submit PR?
Code of Conduct