apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.43k stars 14.36k forks source link

Logs from elastic not visible in Grid menu #39323

Open ivanp221 opened 7 months ago

ivanp221 commented 7 months ago

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.6.2

What happened?

When opening task log from grid menu - loading animation is played and logs won't show. Webserver logs show: webserver [2024-04-30T09:51:42.825+0000] {base.py:274} DEBUG - < {"took":8,"timed_out":false,"_shards":{"total":53,"successful":53,"skipped":51,"failed":0},"hits":{"total":{"value":17,"relation":"eq"},"max_score":null,"hits":[{"_index": ....... Here I can see all the actual logs that are in elastic an then webserver just bombards elastic with: webserver [2024-04-30T09:48:34.688+0000] {base.py:270} INFO - POST https://elastic:9200/_all/_count [status:200 request:0.009s] webserver [2024-04-30T09:48:34.688+0000] {base.py:273} DEBUG - > {"query":{"bool":{"filter":[{"range":{"offset":{"gt":1714470190001012992}}}],"must":[{"match_phrase":{"log_id":"Redis_Populate_Cities_populate_redis_cities_2024_04_30T09_43_06_661556_1"}}]}}} webserver [2024-04-30T09:48:34.689+0000] {base.py:274} DEBUG - < {"count":0,"_shards":{"total":53,"successful":53,"skipped":0,"failed":0}}

Here is my config from values.yaml regarding elastic:

elasticsearch:
  enabled: true
  secretName: airflow-elastic-secret
config:
  elasticsearch:
    end_of_log_mark: end_of_log
    json_fields: asctime, filename, lineno, levelname, message
    json_format: "True"
    log_id_template: "{dag_id}_{task_id}_{execution_date}_{try_number}"
    write_stdout: "True"
  elasticsearch_configs:
    max_retries: 2
    retry_timeout: "True"
    timeout: 10
    use_ssl: "True"
    verify_certs: "False"

What you think should happen instead?

No response

How to reproduce

Can be reproduced by clicking dag name in ui and then from grid selecting task and selecting task's logs: image

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

apache-airflow==2.6.2 apache-airflow-providers-amazon==8.1.0 apache-airflow-providers-celery==3.2.0 apache-airflow-providers-cncf-kubernetes==7.0.0 apache-airflow-providers-common-sql==1.5.1 apache-airflow-providers-docker==3.7.0 apache-airflow-providers-elasticsearch==4.5.0 apache-airflow-providers-ftp==3.4.1 apache-airflow-providers-google==10.1.1 apache-airflow-providers-grpc==3.2.0 apache-airflow-providers-hashicorp==3.4.0 apache-airflow-providers-http==4.4.1 apache-airflow-providers-imap==3.2.1 apache-airflow-providers-microsoft-azure==6.1.1 apache-airflow-providers-microsoft-mssql==3.4.1 apache-airflow-providers-mongo==3.2.1 apache-airflow-providers-mysql==5.1.0 apache-airflow-providers-odbc==3.3.0 apache-airflow-providers-postgres==5.5.0 apache-airflow-providers-redis==3.2.0 apache-airflow-providers-sendgrid==3.2.0 apache-airflow-providers-sftp==4.3.1 apache-airflow-providers-slack==7.3.0 apache-airflow-providers-snowflake==4.1.0 apache-airflow-providers-sqlite==3.4.1 apache-airflow-providers-ssh==3.7.0

Deployment

Official Apache Airflow Helm Chart

Deployment details

Airflow is deployed with official helm chart with changes in values.yaml Dockerfile is customized with pip installs Please let me know if any additional info is required

Anything else?

Couple of caveats: 1) When selecting "See More" log is shown. 2) Log also can be viewed from Graph - Task - Log. (But here i have another bug - my logs doesnt contain end_of_log so it keeps polling elastic.) 3) When dag is in running state - i can see its log in grid view. 4) It's only pending when there is a log in elastic (and i can see it with right log_id ) if there is no log in elastic - message will show Log not found in Elasticsearch... etc

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 7 months ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

Taragolis commented 7 months ago

Is it reproduced in the latest apache-airflow-providers-elasticsearch? And maybe also worthwhile to check on the latest Airflow (2.9.0 at that moment)

ivanp221 commented 7 months ago

will test and report thank you

RNHTTR commented 7 months ago

It's also worth checking your browser's dev tools to see what happens when loading the page.

ivanp221 commented 7 months ago

Sorry for the delay - was having trouble with updating to 2.9.0.

Tested it with latest provider on 2.6.2 apache-airflow-providers-elasticsearch==5.3.4 - same Tested with airflow 2.9.0 with apache-airflow-providers-elasticsearch==5.3.4 - same

Also checked dev-tools it says while pending: Request URL: https://url/api/v1/dags/Redis_Populate_Cities/dagRuns/manual__2024-05-03T15:35:16.283558+00:00/taskInstances/populate_redis_cities/logs/1?full_content=false Referrer Policy: strict-origin-when-cross-origin

It says when there's no log: Request URL: https://url/api/v1/dags/Redis_Populate_Cities/dagRuns/manual__2024-04-04T07:01:51.667463+00:00/taskInstances/populate_redis_cities/logs/1?full_content=false Request Method: GET Status Code: 200 OK Remote Address: address:443 Referrer Policy: strict-origin-when-cross-origin

sjnov11 commented 5 months ago

Same happens to me. i am using images 2.6.3-python3.11

Grid menu logs call this API:

pod logs:

[2024-06-11T01:31:13.259+0900] {base.py:270} INFO - POST {host:port}/_all/_count [status:200 request:0.003s]
[2024-06-11T01:31:13.774+0900] {base.py:270} INFO - POST {host:port}/_all/_count [status:200 request:0.003s]
[2024-06-11T01:31:14.220+0900] {base.py:270} INFO - POST {host:port}/_all/_count [status:200 request:0.003s]
[2024-06-11T01:31:14.289+0900] {base.py:270} INFO - POST {host:port}/_all/_count [status:200 request:0.003s]
[2024-06-11T01:31:14.736+0900] {base.py:270} INFO - POST {host:port}/_all/_count [status:200 request:0.003s]
[2024-06-11T01:31:14.805+0900] {base.py:270} INFO - POST {host:port}/_all/_count [status:200 request:0.003s]
[2024-06-11T01:31:15.252+0900] {base.py:270} INFO - POST {host:port}/_all/_count [status:200 request:0.003s]
....

[2024-06-11 01:31:22 +0900] [109] [CRITICAL] WORKER TIMEOUT (pid:161)
[2024-06-11 01:31:22 +0900] [161] [INFO] Worker exiting (pid: 161)
[2024-06-11 01:31:23 +0900] [109] [WARNING] Worker with pid 161 was terminated due to signal 9
[2024-06-11 01:31:23 +0900] [168] [INFO] Booting worker with pid: 168
ryandutton commented 5 months ago

I'm seeing the same issue with the following versions:

Apache Airflow version | 2.8.4

Providers info apache-airflow-providers-elasticsearch | 5.3.3

It only appears to be happening on a single DAG and task run, not all of them.

ryandutton commented 5 months ago

It looks like there's an indefinite loop when attempting to get the logs from Elasticsearch happening when reading the log stream https://github.com/apache/airflow/blob/2.8.4/airflow/utils/log/log_reader.py#L86-L99

This along with the retry every second is causing the instance to eventually crash and restart. I'm not sure what the cause of the logs not not being polled is but this is having a significant impact on the health of Airflow.

RNHTTR commented 5 months ago

@ryandutton @sjnov11 @ivanp221 is it possible the logs are too large for the webserver to render, resulting in the webserver getting OOMKilled?

ivanp221 commented 5 months ago

Im not having it OOM killed just many webserver logs as is sends requests to elastic. Logs like in my original post

ryandutton commented 5 months ago

I don't believe so, the logs as viewed in Kibana look consistent with every other run we have were the logs are loading normally. Also, it doesn't appear to be OOM killed, there is an increase in memory at the time this is happening but not anything significant

eladkal commented 5 months ago

Checking again if anyone of the reporters can reproduce the issue with Airflow 2.9 or (better) main branch?

ivanp221 commented 5 months ago

I was wondering maybe its somehow connected to the fact that i dont have "end of log" line in my logs?

YaAnuar commented 3 months ago

I solved this issue by adding new message into elasticsearch: message: "end_of_log"

curl -X POST "http://elasticsearch-logs:9200/airflow-logs/_doc/" -H "Content-Type: application/json" -d '{ "log_id": "for_test_start_parse_2024_08_04T06_15_00_000000_1", "message": "end_of_log", "offset": 1 }'

Mant1ss commented 3 months ago

airflow 2.9.0 apache-airflow-providers-elasticsearch 5.3.3 same problem. There is no "end_of_log" in elastic. /log? page show log and infinite reload (can't get end_of_log) dags/{dag_id}/grid? show failed status for task intance. After some time webserver OOM error

eladkal commented 3 months ago

cc @Owen-CH-Leung did you encounter this problem with your latest work with elastic provider?

Owen-CH-Leung commented 3 months ago

@eladkal No. the Logs tab are rendering fine. On main branch. I've enabled airflow remote logging with ES & I'm able to render logs from ES:

image

Here are some log outputs from webserver:

[2024-08-20T06:55:34.714+0000] {_transport.py:349} INFO - POST http://es01:9200/_all/_count [status:200 duration:0.008s]
[2024-08-20T06:55:34.719+0000] {_transport.py:349} INFO - POST http://es01:9200/_all/_search [status:200 duration:0.004s]
192.168.65.1 - - [20/Aug/2024:06:55:34 +0000] "GET /api/v1/dags/elasticsearch_sql_dag/dagRuns/manual__2024-08-20T06:52:41.314339+00:00/taskInstances/run_es_query/logs/1?full_content=false HTTP/1.1" 200 4887 "http://localhost:28080/dags/elasticsearch_sql_dag/grid?dag_run_id=manual__2024-08-20T06%3A52%3A41.314339%2B00%3A00&task_id=run_es_query&tab=logs" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"
Mant1ss commented 3 months ago

@Owen-CH-Leung can you try long running task with 5k+ log lines?

Owen-CH-Leung commented 3 months ago

Also no problem. I wrote a simple DAG that executes a SHOW TABLE query :

def query_elasticsearch():
    es = ElasticsearchSQLHook(elasticsearch_conn_id='elasticsearch_default')

    es_connection = es.get_conn()
    response = es_connection.execute_sql("SHOW TABLES")
    for i in range(5000):
        for row in response["rows"]:
            print(f"row: {row}")

This results in a log file which consists of around 35k log lines:

airflow git:(main) ✗ cat logs/dag_id=elasticsearch_sql_dag/run_id=manual__2024-08-20T07:19:51.268943+00:00/task_id=run_es_query/attempt=1.log | wc -l
   35028

And the UI is still able to render it :

image

Owen-CH-Leung commented 3 months ago

In case it may help: I'm using simple curl command to upload the raw json logs to Elasticsearch (instead of filebeat) :

#/bin/bash

while IFS= read -r line
do
  curl -X POST "http://localhost:9200/[your index name]/_doc/" -H 'Content-Type: application/json' -d "$line"
done < [your DAG task log file]

Some important configs in airflow.cfg :

remote_logging = True
remote_log_conn_id = elasticsearch_default

[elasticsearch] 
# Elasticsearch host  
host = [your ES host]
json_format = True
offset_field = log

the raw task log file should look like the following:

{"asctime": "2024-08-20T06:52:42.617+0000", "filename": "local_task_job_runner.py", "lineno": 123, "levelname": "INFO", "message": "::group::Pre task execution logs", "log": 1724136762617922560, "dag_id": "elasticsearch_sql_dag", "task_id": "run_es_query", "execution_date": "2024_08_20T06_52_41_314339", "try_number": "1", "log_id": "elasticsearch_sql_dag-run_es_query-manual__2024-08-20T06:52:41.314339+00:00--1-1"}

If you are using tools like filebeat, be aware that not all keys in your raw log file will be attached as a field in a document inside ES. No matter what tool you use to write logs to ES, you need to make sure that a single document in ES looks like this:

{
        "_index": "airflow-dag",
        "_id": "o_ysbpEBEcTOdHp1B7dC",
        "_score": 1,
        "_source": {
          "asctime": "2024-08-20T07:19:54.376+0000",
          "filename": "logging_mixin.py",
          "lineno": 190,
          "levelname": "INFO",
          "message": "row: ['docker-cluster', '.alerts-observability.slo.alerts-default', 'VIEW', 'ALIAS']",
          "log": 1724138394376478200,
          "dag_id": "elasticsearch_sql_dag",
          "task_id": "run_es_query",
          "execution_date": "2024_08_20T07_19_51_268943",
          "try_number": "1",
          "log_id": "elasticsearch_sql_dag-run_es_query-manual__2024-08-20T07:19:51.268943+00:00--1-1"
        }
      }

Make sure the 2 fields log and log_id are present (otherwise it won't work). To sample check your document in ES, try to run the following in Kibana:

GET /[your index name]/_search
{
  "size": 1
}

Or curl -X GET "http://[your ES host]/[your index name[/_search?size=1" -H 'Content-Type: application/json' to get a sample document back

Mant1ss commented 3 months ago

Checked more information from devtools:

Mant1ss commented 3 months ago

same issue https://github.com/apache/airflow/issues/28326

mrybas commented 1 month ago

Hi! I have the same issue with

Apache Airflow
version                | 2.9.3
Providers info
apache-airflow-providers-elasticsearch   | 5.4.1
rawwar commented 1 month ago

Noticed this issue for a BashOperator task. I'll edit this comment and more details later today.