apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.47k stars 14.13k forks source link

Viewing logs of running task in grid view hangs airflow-webserver [elasticsearch] #28326

Open sun2everyone opened 1 year ago

sun2everyone commented 1 year ago

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

Airflow version: Airflow 2.4.2 with Elasticsearch log provider (ELK 7.17.0) Python version: 3.8.13

Short description: Airflow webserver stops responding due to infinity log loading loop.

Details: If several users open grid view page and try to view logs of running task, they see no logs, but gunicorn workers became busy processing log loading requests. After some time all gunicorn workers became busy waiting for the end of log, and webserver stops responding to further client requests.

What you can see in webserver logs: constantly appearing requests to elastic

{base.py:270} INFO - POST http://127.0.0.1:9202/_count [status:200 request:0.006s]
{base.py:270} INFO - POST http://127.0.0.1:9202/_search [status:200 request:0.015s]
{base.py:270} INFO - POST http://127.0.0.1:9202/_search [status:200 request:0.015s]
{base.py:270} INFO - POST http://127.0.0.1:9202/_count [status:200 request:0.010s]
{base.py:270} INFO - POST http://127.0.0.1:9202/_search [status:200 request:0.014s]
{base.py:270} INFO - POST http://127.0.0.1:9202/_count [status:200 request:0.011s]

What you think should happen instead

When trying to view logs of running task in grid view, you should receive one of the following:

  1. Message that running task logs couldn't be shown
  2. See partially loaded logs, that exist at the moment you load the page
  3. See continuously appearing logs, as it is done on the main task log page

How to reproduce

  1. Start airflow webserver with ELK logging and sync gunicorn workers
  2. Create dag with a task that is running for a long time and constantly writing it's logs
  3. Run that dag
  4. Open that dag in a grid view, try to view logs of that running task in a grid view
  5. Open more pages viewing logs of that running task in a grid view (5-6 or more)
  6. Webserver starts to work slowly, skips healthchecks and does not respond client requests

Operating System

Ubuntu 20.04.2 LTS

Versions of Apache Airflow Providers

apache-airflow-providers-elasticsearch 4.2.1

Deployment

Virtualenv installation

Deployment details

No response

Anything else

I guess the problem is in this parts:

Log reading stream keeps gunicorn worker busy all the time the task is in the Running state.

Using async workers like tornado makes a little better, but doesn't solve the problem. For now, I use this workaround:

--- log_endpoint_py.default    2022-12-12 19:16:30.280526903 +0300
+++ log_endpoint.py    2022-12-12 19:03:29.128970011 +0300
@@ -33,7 +33,7 @@
 from airflow.utils.airflow_flask_app import get_airflow_app
 from airflow.utils.log.log_reader import TaskLogReader
 from airflow.utils.session import NEW_SESSION, provide_session
-
+from airflow.utils.state import State 

@security.requires_access(
     [
@@ -108,6 +108,8 @@
         token = URLSafeSerializer(key).dumps(metadata)  # type: ignore[assignment]
         return logs_schema.dump(LogResponseObject(continuation_token=token, content=logs))
     # text/plain. Stream
-    logs = task_log_reader.read_log_stream(ti, task_try_number, metadata)
-
+    if ti.state not in State.running:
+        logs = task_log_reader.read_log_stream(ti, task_try_number, metadata)
+    else:
+        logs = {"Task is still running, logs in grid view not available."}
     return Response(logs, headers={"Content-Type": return_type}) 

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 1 year ago

Thanks for opening your first issue here! Be sure to follow the issue template!

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.

github-actions[bot] commented 7 months ago

This issue has been closed because it has not received response from the issue author.

rightx2 commented 6 months ago

I have had exact same problem. This issue should not be as stale.. @sun2everyone Did you solve this?

potiuk commented 6 months ago

@rightx2 happy to reopen it, but it would help if you could provide more information that can help someone who might look at the issue (or maybe even look at it yourself and attempt to make PR to fix it).

rightx2 commented 6 months ago

@potiuk I think what sun2everyone wrote is enough to describe the problem... Do you want think it needs some other info ?

potiuk commented 6 months ago

@potiuk I think what sun2everyone wrote is enough to describe the problem... Do you want think it needs some other info ?

No idea. But by providing more info, specific to your case you increase chances that someone will have easier time of diagnosing it and maybe fixing. This is Open-source project. Maybe someone will take a look at the issue, maybe not. Maybe they will decide to spend their time on trying to reproduce it, diagnose and fix, or maybe not. Maybe if you gather some evidences you will find more informaation and show differences vs what the original report that will make it easier to diagnose. Or maybe by gathering evidences you will find it yourself, or maybe you will find that this is a different issue. Finally - maybe no-one will look at it. Seems like no-one was looking at it and diagnosing it since December 2022.

All these are maybes. All at scale of proability. And by providing more information you likely move the probability to be in favour of problem being diagnosed and fixed.

That's all I am saying.

potiuk commented 6 months ago

Simply speaking - the more likely and easier you make it for somoene to diagnose and fix it, the more likely it will be diagnoe and fix it. So if you want it to get it fixed, spending time on making it easier (which mostly means gathering evidences, spotting the patterns, and doing more diagnosis on your side) is a good way to increase chances for it to happen.

rightx2 commented 6 months ago

Sounds reasonable. I'll do some experiments to reproduce them a lot easier and update it

sun2everyone commented 6 months ago

Hello. I think ideal solution to this issue is to rewrite code for displaying logs on grid view page as it is done on logs page (continious batch loading). This is more simple solution I implemented (Airflow 2.4.2):

diff --git a/lib/python3.8/site-packages/airflow/api_connexion/endpoints/log_endpoint.py b/lib/python3.8/site-packages/airflow/api_connexion/endpoints/log_endpoint.py
index bda0af3..364a48f 100644
--- a/lib/python3.8/site-packages/airflow/api_connexion/endpoints/log_endpoint.py
+++ b/lib/python3.8/site-packages/airflow/api_connexion/endpoints/log_endpoint.py
@@ -33,7 +33,7 @@
 from airflow.utils.airflow_flask_app import get_airflow_app
 from airflow.utils.log.log_reader import TaskLogReader
 from airflow.utils.session import NEW_SESSION, provide_session
-
+from airflow.utils.state import State

 @security.requires_access(
     [
@@ -108,6 +108,8 @@
         token = URLSafeSerializer(key).dumps(metadata)  # type: ignore[assignment]
         return logs_schema.dump(LogResponseObject(continuation_token=token, content=logs))
     # text/plain. Stream
-    logs = task_log_reader.read_log_stream(ti, task_try_number, metadata)
-
+    if ti.state not in State.running:
+        logs = task_log_reader.read_log_stream(ti, task_try_number, metadata)
+    else:
+        logs = {"Task is still running, logs in grid view not available."}
     return Response(logs, headers={"Content-Type": return_type})

It just disables displaying logs in grid view if task state is 'Running'. Works in production for more than a year, without any problem. I guess it could still cause some trouble if you have really huge task log, but I didn't meet such cases.