[Improvement] Improve the performance of ranger access requests

wankunde commented 1 month ago

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Search before asking

[X] I have searched in the issues and found no similar issues.

What would you like to be improved?

Right now in RuleAuthorization we use an ArrayBuffer to collect access requests, which is very slow because each new PrivilegeObject needs to be compared with all access requests.

How should we improve?

We can use a HashMap to optimize this.

Are you willing to submit PR?

[X] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
[ ] No. I cannot submit a PR at this time.

github-actions[bot] commented 1 month ago

Hello @wankunde, Thanks for finding the time to report the issue! We really appreciate the community's efforts to improve Apache Kyuubi.

wankunde commented 1 month ago

Test with local 50000 files:

test("KYUUBI #6754: improve the performance of ranger access requests") {
    val outputPath = "/private/var/folders/tr/scn8dgl13_l6_sh17bghtln1b35kn1/T/kyuubi-test-5492934124608743789/"
    println("output path: "+ outputPath)

    val plugin = mock[SparkRangerAdminPlugin.type]
    when(plugin.verify(Seq(any[RangerAccessRequest]), any[SparkRangerAuditHandler]))
      .thenAnswer(_ => ())

    val df = spark.read.parquet(outputPath + "/*/*.parquet")
    val plan = df.queryExecution.optimizedPlan
    val start = System.currentTimeMillis()
    RuleAuthorization(spark).checkPrivileges(spark, plan)
    val end = System.currentTimeMillis()
    println(s"Time elapsed : ${end - start} ms")
  }

Before Before After After

apache / kyuubi