perf: Cache jstrings during metrics collection

mbutrovich commented 1 month ago

Which issue does this PR close?

Partially addresses #1024.

Rationale for this change

Comet uses JNI jstrings as the keys to updating metrics values on the Spark side during execution. As described in #1024, currently Comet allocates a jstring for every metric for every invocation of metrics updating. The calls to jni_NewStringUTF account for over 1% of the on-CPU time in TPC-H SF10 for me.

What changes are included in this PR?

Added a HashMap that maps the native string to a jstring to use in JNI calls. This has the benefit of being many-to-one, whereby multiple nodes with the same metric name will benefit from the cached jstring. This cache is populated on demand: if the entry isn't present, we allocate a jstring and insert it into the cache.

I have some thoughts about this approach that I would love for reviewers to comment on:

I wanted to populate the cache in advance, maybe do a plan traversal when it's generated in ExecutePlan. However, DF's metrics are Options and don't actually appear to be there until the plan starts executing.
What is the thread safety of this approach? It's unclear to me if multiple threads could be sharing this call stack and trying to write new values into the cache at the same time. I could wrap the HashMap in a latch in exchange for a performance hit, but would like to understand if this is even possible.
I think I understood the jni crate's docs with respect to GlobalRef, but a sanity check on if this approach could hold references longer than we want (and leak) would be helpful.

How are these changes tested?

Existing tests on the Java side that exercise metrics.

mbutrovich commented 1 month ago

I will update with some benchmark results tomorrow, but initial results look promising.

andygrove commented 4 weeks ago

I ran some benchmarks locally and confirmed a speedup:

tpch_allqueries

tpch_queries_speedup

andygrove commented 4 weeks ago

The speedup on q4 is pretty impressive!

Here are the raw JSON benchmark result files:

baseline.json

pr1029.json

mbutrovich commented 4 weeks ago

Thanks for running the benchmarks for me. I was struggling to get reproducible results locally.

andygrove commented 4 weeks ago

~I wonder why there is such a large regression with q72 though~

edit: posted the wrong pngs from the wrong benchmark - updated now

andygrove commented 4 weeks ago

I think I understood the jni crate's docs with respect to GlobalRef, but a sanity check on if this approach could hold references longer than we want (and leak) would be helpful.

This seems correct to me

andygrove commented 4 weeks ago

I had earlier posted fresh benchmarks that showed a big improvement with the latest commit but I had inadvertently enabled the new replaceSortMergeJoin feature. I ran again without that enabled and essentially see the same results as the original run (367 seconds versus the earlier 365, which is likely just noise).

andygrove commented 4 weeks ago

What is the thread safety of this approach? It's unclear to me if multiple threads could be sharing this call stack and trying to write new values into the cache at the same time. I could wrap the HashMap in a latch in exchange for a performance hit, but would like to understand if this is even possible.

Spark has a single thread calling CometExecIterator, which in turn calls createPlan, executePlan, and releasePlan, so think the current approach is safe.

apache / datafusion-comet