An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
I specifically identified (a) the 50 slowest test suites and (b) the average suite duration excluding those top 50 (it was 0.71 minutes)
I used this information to update TestParallelization to do smarter test suite assignment. The logic is as follows:
For the top 50 slowest test suites, we assign them deterministically by, in sorted descending order, assigning the suites to the shard + group (group means thread) with the lowest duration so far.
For the remaining tests that are not in the top 50, we assign them to a random shard, and within that shard we assign it to the group with the lowest duration so far, too
We also update the hash function used to me MurmurHash3 which is known to create balanced assignments in scenarios where the input strings (test names) might have similar prefixes or patterns
Note that purely adding another shard and using a better hash function does NOT yield any better results. That was attempted here: https://github.com/delta-io/delta/pull/3715.
Which Delta project/connector is this regarding?
Description
This PR reduces delta-spark CI test runtime by 33 mins. Previously the max shard duration was 1h 46 mins, and now it is 1h 13 mins.
This PR does so by the following
TestParallelization
to do smarter test suite assignment. The logic is as follows:Note that purely adding another shard and using a better hash function does NOT yield any better results. That was attempted here: https://github.com/delta-io/delta/pull/3715.
How was this patch tested?
GitHub Ci tests.
https://github.com/delta-io/delta/actions/runs/11004181545?pr=3712
Does this PR introduce any user-facing changes?
No.