Open sun2everyone opened 1 year ago
Thanks for opening your first issue here! Be sure to follow the issue template!
This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.
This issue has been closed because it has not received response from the issue author.
I have had exact same problem. This issue should not be as stale.. @sun2everyone Did you solve this?
@rightx2 happy to reopen it, but it would help if you could provide more information that can help someone who might look at the issue (or maybe even look at it yourself and attempt to make PR to fix it).
@potiuk I think what sun2everyone wrote is enough to describe the problem... Do you want think it needs some other info ?
@potiuk I think what sun2everyone wrote is enough to describe the problem... Do you want think it needs some other info ?
No idea. But by providing more info, specific to your case you increase chances that someone will have easier time of diagnosing it and maybe fixing. This is Open-source project. Maybe someone will take a look at the issue, maybe not. Maybe they will decide to spend their time on trying to reproduce it, diagnose and fix, or maybe not. Maybe if you gather some evidences you will find more informaation and show differences vs what the original report that will make it easier to diagnose. Or maybe by gathering evidences you will find it yourself, or maybe you will find that this is a different issue. Finally - maybe no-one will look at it. Seems like no-one was looking at it and diagnosing it since December 2022.
All these are maybes. All at scale of proability. And by providing more information you likely move the probability to be in favour of problem being diagnosed and fixed.
That's all I am saying.
Simply speaking - the more likely and easier you make it for somoene to diagnose and fix it, the more likely it will be diagnoe and fix it. So if you want it to get it fixed, spending time on making it easier (which mostly means gathering evidences, spotting the patterns, and doing more diagnosis on your side) is a good way to increase chances for it to happen.
Sounds reasonable. I'll do some experiments to reproduce them a lot easier and update it
Hello. I think ideal solution to this issue is to rewrite code for displaying logs on grid view page as it is done on logs page (continious batch loading). This is more simple solution I implemented (Airflow 2.4.2):
diff --git a/lib/python3.8/site-packages/airflow/api_connexion/endpoints/log_endpoint.py b/lib/python3.8/site-packages/airflow/api_connexion/endpoints/log_endpoint.py
index bda0af3..364a48f 100644
--- a/lib/python3.8/site-packages/airflow/api_connexion/endpoints/log_endpoint.py
+++ b/lib/python3.8/site-packages/airflow/api_connexion/endpoints/log_endpoint.py
@@ -33,7 +33,7 @@
from airflow.utils.airflow_flask_app import get_airflow_app
from airflow.utils.log.log_reader import TaskLogReader
from airflow.utils.session import NEW_SESSION, provide_session
-
+from airflow.utils.state import State
@security.requires_access(
[
@@ -108,6 +108,8 @@
token = URLSafeSerializer(key).dumps(metadata) # type: ignore[assignment]
return logs_schema.dump(LogResponseObject(continuation_token=token, content=logs))
# text/plain. Stream
- logs = task_log_reader.read_log_stream(ti, task_try_number, metadata)
-
+ if ti.state not in State.running:
+ logs = task_log_reader.read_log_stream(ti, task_try_number, metadata)
+ else:
+ logs = {"Task is still running, logs in grid view not available."}
return Response(logs, headers={"Content-Type": return_type})
It just disables displaying logs in grid view if task state is 'Running'. Works in production for more than a year, without any problem. I guess it could still cause some trouble if you have really huge task log, but I didn't meet such cases.
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
Airflow version: Airflow 2.4.2 with Elasticsearch log provider (ELK 7.17.0) Python version: 3.8.13
Short description: Airflow webserver stops responding due to infinity log loading loop.
Details: If several users open grid view page and try to view logs of running task, they see no logs, but gunicorn workers became busy processing log loading requests. After some time all gunicorn workers became busy waiting for the end of log, and webserver stops responding to further client requests.
What you can see in webserver logs: constantly appearing requests to elastic
What you think should happen instead
When trying to view logs of running task in grid view, you should receive one of the following:
How to reproduce
Operating System
Ubuntu 20.04.2 LTS
Versions of Apache Airflow Providers
apache-airflow-providers-elasticsearch 4.2.1
Deployment
Virtualenv installation
Deployment details
No response
Anything else
I guess the problem is in this parts:
Log reading stream keeps gunicorn worker busy all the time the task is in the Running state.
Using async workers like tornado makes a little better, but doesn't solve the problem. For now, I use this workaround:
Are you willing to submit PR?
Code of Conduct