Open ivanp221 opened 7 months ago
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
Is it reproduced in the latest apache-airflow-providers-elasticsearch
? And maybe also worthwhile to check on the latest Airflow (2.9.0 at that moment)
will test and report thank you
It's also worth checking your browser's dev tools to see what happens when loading the page.
Sorry for the delay - was having trouble with updating to 2.9.0.
Tested it with latest provider on 2.6.2 apache-airflow-providers-elasticsearch==5.3.4 - same Tested with airflow 2.9.0 with apache-airflow-providers-elasticsearch==5.3.4 - same
Also checked dev-tools it says while pending: Request URL: https://url/api/v1/dags/Redis_Populate_Cities/dagRuns/manual__2024-05-03T15:35:16.283558+00:00/taskInstances/populate_redis_cities/logs/1?full_content=false Referrer Policy: strict-origin-when-cross-origin
It says when there's no log: Request URL: https://url/api/v1/dags/Redis_Populate_Cities/dagRuns/manual__2024-04-04T07:01:51.667463+00:00/taskInstances/populate_redis_cities/logs/1?full_content=false Request Method: GET Status Code: 200 OK Remote Address: address:443 Referrer Policy: strict-origin-when-cross-origin
Same happens to me. i am using images 2.6.3-python3.11
Grid menu logs call this API:
pod logs:
[2024-06-11T01:31:13.259+0900] {base.py:270} INFO - POST {host:port}/_all/_count [status:200 request:0.003s]
[2024-06-11T01:31:13.774+0900] {base.py:270} INFO - POST {host:port}/_all/_count [status:200 request:0.003s]
[2024-06-11T01:31:14.220+0900] {base.py:270} INFO - POST {host:port}/_all/_count [status:200 request:0.003s]
[2024-06-11T01:31:14.289+0900] {base.py:270} INFO - POST {host:port}/_all/_count [status:200 request:0.003s]
[2024-06-11T01:31:14.736+0900] {base.py:270} INFO - POST {host:port}/_all/_count [status:200 request:0.003s]
[2024-06-11T01:31:14.805+0900] {base.py:270} INFO - POST {host:port}/_all/_count [status:200 request:0.003s]
[2024-06-11T01:31:15.252+0900] {base.py:270} INFO - POST {host:port}/_all/_count [status:200 request:0.003s]
....
[2024-06-11 01:31:22 +0900] [109] [CRITICAL] WORKER TIMEOUT (pid:161)
[2024-06-11 01:31:22 +0900] [161] [INFO] Worker exiting (pid: 161)
[2024-06-11 01:31:23 +0900] [109] [WARNING] Worker with pid 161 was terminated due to signal 9
[2024-06-11 01:31:23 +0900] [168] [INFO] Booting worker with pid: 168
I'm seeing the same issue with the following versions:
Apache Airflow version | 2.8.4
Providers info apache-airflow-providers-elasticsearch | 5.3.3
It only appears to be happening on a single DAG and task run, not all of them.
It looks like there's an indefinite loop when attempting to get the logs from Elasticsearch happening when reading the log stream https://github.com/apache/airflow/blob/2.8.4/airflow/utils/log/log_reader.py#L86-L99
This along with the retry every second is causing the instance to eventually crash and restart. I'm not sure what the cause of the logs not not being polled is but this is having a significant impact on the health of Airflow.
@ryandutton @sjnov11 @ivanp221 is it possible the logs are too large for the webserver to render, resulting in the webserver getting OOMKilled?
Im not having it OOM killed just many webserver logs as is sends requests to elastic. Logs like in my original post
I don't believe so, the logs as viewed in Kibana look consistent with every other run we have were the logs are loading normally. Also, it doesn't appear to be OOM killed, there is an increase in memory at the time this is happening but not anything significant
Checking again if anyone of the reporters can reproduce the issue with Airflow 2.9 or (better) main branch?
I was wondering maybe its somehow connected to the fact that i dont have "end of log" line in my logs?
I solved this issue by adding new message into elasticsearch: message: "end_of_log"
curl -X POST "http://elasticsearch-logs:9200/airflow-logs/_doc/" -H "Content-Type: application/json" -d '{ "log_id": "for_test_start_parse_2024_08_04T06_15_00_000000_1", "message": "end_of_log", "offset": 1 }'
airflow 2.9.0 apache-airflow-providers-elasticsearch 5.3.3 same problem. There is no "end_of_log" in elastic. /log? page show log and infinite reload (can't get end_of_log) dags/{dag_id}/grid? show failed status for task intance. After some time webserver OOM error
cc @Owen-CH-Leung did you encounter this problem with your latest work with elastic provider?
@eladkal No. the Logs
tab are rendering fine. On main branch. I've enabled airflow remote logging with ES & I'm able to render logs from ES:
Here are some log outputs from webserver:
[2024-08-20T06:55:34.714+0000] {_transport.py:349} INFO - POST http://es01:9200/_all/_count [status:200 duration:0.008s]
[2024-08-20T06:55:34.719+0000] {_transport.py:349} INFO - POST http://es01:9200/_all/_search [status:200 duration:0.004s]
192.168.65.1 - - [20/Aug/2024:06:55:34 +0000] "GET /api/v1/dags/elasticsearch_sql_dag/dagRuns/manual__2024-08-20T06:52:41.314339+00:00/taskInstances/run_es_query/logs/1?full_content=false HTTP/1.1" 200 4887 "http://localhost:28080/dags/elasticsearch_sql_dag/grid?dag_run_id=manual__2024-08-20T06%3A52%3A41.314339%2B00%3A00&task_id=run_es_query&tab=logs" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"
@Owen-CH-Leung can you try long running task with 5k+ log lines?
Also no problem. I wrote a simple DAG that executes a SHOW TABLE
query :
def query_elasticsearch():
es = ElasticsearchSQLHook(elasticsearch_conn_id='elasticsearch_default')
es_connection = es.get_conn()
response = es_connection.execute_sql("SHOW TABLES")
for i in range(5000):
for row in response["rows"]:
print(f"row: {row}")
This results in a log file which consists of around 35k log lines:
airflow git:(main) ✗ cat logs/dag_id=elasticsearch_sql_dag/run_id=manual__2024-08-20T07:19:51.268943+00:00/task_id=run_es_query/attempt=1.log | wc -l
35028
And the UI is still able to render it :
In case it may help: I'm using simple curl command to upload the raw json logs to Elasticsearch (instead of filebeat) :
#/bin/bash
while IFS= read -r line
do
curl -X POST "http://localhost:9200/[your index name]/_doc/" -H 'Content-Type: application/json' -d "$line"
done < [your DAG task log file]
Some important configs in airflow.cfg
:
remote_logging = True
remote_log_conn_id = elasticsearch_default
[elasticsearch]
# Elasticsearch host
host = [your ES host]
json_format = True
offset_field = log
the raw task log file should look like the following:
{"asctime": "2024-08-20T06:52:42.617+0000", "filename": "local_task_job_runner.py", "lineno": 123, "levelname": "INFO", "message": "::group::Pre task execution logs", "log": 1724136762617922560, "dag_id": "elasticsearch_sql_dag", "task_id": "run_es_query", "execution_date": "2024_08_20T06_52_41_314339", "try_number": "1", "log_id": "elasticsearch_sql_dag-run_es_query-manual__2024-08-20T06:52:41.314339+00:00--1-1"}
If you are using tools like filebeat, be aware that not all keys in your raw log file will be attached as a field in a document inside ES. No matter what tool you use to write logs to ES, you need to make sure that a single document in ES looks like this:
{
"_index": "airflow-dag",
"_id": "o_ysbpEBEcTOdHp1B7dC",
"_score": 1,
"_source": {
"asctime": "2024-08-20T07:19:54.376+0000",
"filename": "logging_mixin.py",
"lineno": 190,
"levelname": "INFO",
"message": "row: ['docker-cluster', '.alerts-observability.slo.alerts-default', 'VIEW', 'ALIAS']",
"log": 1724138394376478200,
"dag_id": "elasticsearch_sql_dag",
"task_id": "run_es_query",
"execution_date": "2024_08_20T07_19_51_268943",
"try_number": "1",
"log_id": "elasticsearch_sql_dag-run_es_query-manual__2024-08-20T07:19:51.268943+00:00--1-1"
}
}
Make sure the 2 fields log
and log_id
are present (otherwise it won't work). To sample check your document in ES, try to run the following in Kibana:
GET /[your index name]/_search
{
"size": 1
}
Or curl -X GET "http://[your ES host]/[your index name[/_search?size=1" -H 'Content-Type: application/json'
to get a sample document back
Checked more information from devtools:
[2024-08-20T07:31:35.390+0000] {_transport.py:349} INFO - POST https://*/_count [status:200 duration:0.033s]
[2024-08-20T07:31:35.425+0000] {_transport.py:349} INFO - POST https://*/_search [status:200 duration:0.034s]
Hi! I have the same issue with
Apache Airflow
version | 2.9.3
Providers info
apache-airflow-providers-elasticsearch | 5.4.1
Noticed this issue for a BashOperator task. I'll edit this comment and more details later today.
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.6.2
What happened?
When opening task log from grid menu - loading animation is played and logs won't show. Webserver logs show:
webserver [2024-04-30T09:51:42.825+0000] {base.py:274} DEBUG - < {"took":8,"timed_out":false,"_shards":{"total":53,"successful":53,"skipped":51,"failed":0},"hits":{"total":{"value":17,"relation":"eq"},"max_score":null,"hits":[{"_index": .......
Here I can see all the actual logs that are in elastic an then webserver just bombards elastic with:webserver [2024-04-30T09:48:34.688+0000] {base.py:270} INFO - POST https://elastic:9200/_all/_count [status:200 request:0.009s] webserver [2024-04-30T09:48:34.688+0000] {base.py:273} DEBUG - > {"query":{"bool":{"filter":[{"range":{"offset":{"gt":1714470190001012992}}}],"must":[{"match_phrase":{"log_id":"Redis_Populate_Cities_populate_redis_cities_2024_04_30T09_43_06_661556_1"}}]}}} webserver [2024-04-30T09:48:34.689+0000] {base.py:274} DEBUG - < {"count":0,"_shards":{"total":53,"successful":53,"skipped":0,"failed":0}}
Here is my config from values.yaml regarding elastic:
What you think should happen instead?
No response
How to reproduce
Can be reproduced by clicking dag name in ui and then from grid selecting task and selecting task's logs:
Operating System
Debian GNU/Linux 11 (bullseye)
Versions of Apache Airflow Providers
apache-airflow==2.6.2 apache-airflow-providers-amazon==8.1.0 apache-airflow-providers-celery==3.2.0 apache-airflow-providers-cncf-kubernetes==7.0.0 apache-airflow-providers-common-sql==1.5.1 apache-airflow-providers-docker==3.7.0 apache-airflow-providers-elasticsearch==4.5.0 apache-airflow-providers-ftp==3.4.1 apache-airflow-providers-google==10.1.1 apache-airflow-providers-grpc==3.2.0 apache-airflow-providers-hashicorp==3.4.0 apache-airflow-providers-http==4.4.1 apache-airflow-providers-imap==3.2.1 apache-airflow-providers-microsoft-azure==6.1.1 apache-airflow-providers-microsoft-mssql==3.4.1 apache-airflow-providers-mongo==3.2.1 apache-airflow-providers-mysql==5.1.0 apache-airflow-providers-odbc==3.3.0 apache-airflow-providers-postgres==5.5.0 apache-airflow-providers-redis==3.2.0 apache-airflow-providers-sendgrid==3.2.0 apache-airflow-providers-sftp==4.3.1 apache-airflow-providers-slack==7.3.0 apache-airflow-providers-snowflake==4.1.0 apache-airflow-providers-sqlite==3.4.1 apache-airflow-providers-ssh==3.7.0
Deployment
Official Apache Airflow Helm Chart
Deployment details
Airflow is deployed with official helm chart with changes in values.yaml Dockerfile is customized with pip installs Please let me know if any additional info is required
Anything else?
Couple of caveats: 1) When selecting "See More" log is shown. 2) Log also can be viewed from Graph - Task - Log. (But here i have another bug - my logs doesnt contain end_of_log so it keeps polling elastic.) 3) When dag is in running state - i can see its log in grid view. 4) It's only pending when there is a log in elastic (and i can see it with right log_id ) if there is no log in elastic - message will show Log not found in Elasticsearch... etc
Are you willing to submit PR?
Code of Conduct