apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.11k stars 915 forks source link

[Improvement] Improve the performance of ranger access requests #6754

Closed wankunde closed 1 month ago

wankunde commented 1 month ago

Code of Conduct

Search before asking

What would you like to be improved?

Right now in RuleAuthorization we use an ArrayBuffer to collect access requests, which is very slow because each new PrivilegeObject needs to be compared with all access requests.

How should we improve?

We can use a HashMap to optimize this.

Are you willing to submit PR?

github-actions[bot] commented 1 month ago

Hello @wankunde, Thanks for finding the time to report the issue! We really appreciate the community's efforts to improve Apache Kyuubi.

wankunde commented 1 month ago

Test with local 50000 files:

test("KYUUBI #6754: improve the performance of ranger access requests") {
    val outputPath = "/private/var/folders/tr/scn8dgl13_l6_sh17bghtln1b35kn1/T/kyuubi-test-5492934124608743789/"
    println("output path: "+ outputPath)

    val plugin = mock[SparkRangerAdminPlugin.type]
    when(plugin.verify(Seq(any[RangerAccessRequest]), any[SparkRangerAuditHandler]))
      .thenAnswer(_ => ())

    val df = spark.read.parquet(outputPath + "/*/*.parquet")
    val plan = df.queryExecution.optimizedPlan
    val start = System.currentTimeMillis()
    RuleAuthorization(spark).checkPrivileges(spark, plan)
    val end = System.currentTimeMillis()
    println(s"Time elapsed : ${end - start} ms")
  }

Before Before After After