apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Apache License 2.0
2.11k stars 916 forks source link

[FEATURE] Impl Spark DSv2 YARN Connector that supports reading YARN aggregation logs #6832

Open pan3793 opened 5 hours ago

pan3793 commented 5 hours ago

Code of Conduct

Search before asking

Describe the feature

Leverage the Spark DSv2 API to implement a connector that provides a SQL interface to access the YARN agg logs, and maybe other YARN resources in the future.


For large-scale Spark on YARN deployments, there are dozens or even hundreds of thousands of Spark applications submitted to a cluster per day, and the app logs are collected and aggregated by YARN stored on HDFS, sometimes we might want to analyze the logs to identify some cluster-level issues, for example, some machine might have hardware issues that frequently produce disk/network exceptions, it's straightforward to leverage Spark to analyze those logs in parallel.

Describe the solution

the usage might be like

$ spark-sql --conf spark.sql.catalog.yarn=org.apache.kyuubi.spark.connector.yarn.YarnCatalog
    app_id, app_attempt_id,
    app_start_time, app_end_time,
    container_id, host,
    file_name, line_num, message
  FROM yarn.agg_logs
  WHERE app_id = 'application_1234'
    AND container_id='container_12345'
    AND host = 'hadoop123.example.com'

Additional context

No response

Are you willing to submit PR?

naive-zhang commented 4 hours ago

@pan3793 I'd like to try to implement this, please aign it to me, thx~