delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.63k stars 1.71k forks source link

[Infra] [Spark] Reduce delta-spark CI test runtime by 33 mins (1h46m to 1h13m) #3712

Closed scottsand-db closed 1 month ago

scottsand-db commented 2 months ago

Which Delta project/connector is this regarding?

Description

This PR reduces delta-spark CI test runtime by 33 mins. Previously the max shard duration was 1h 46 mins, and now it is 1h 13 mins.

This PR does so by the following

  1. We add an extra shard
  2. I used https://github.com/delta-io/delta/pull/3694 to collect some metrics about delta-spark test runtime execution.
  3. I specifically identified (a) the 50 slowest test suites and (b) the average suite duration excluding those top 50 (it was 0.71 minutes)
  4. I used this information to update TestParallelization to do smarter test suite assignment. The logic is as follows:
    • For the top 50 slowest test suites, we assign them deterministically by, in sorted descending order, assigning the suites to the shard + group (group means thread) with the lowest duration so far.
    • For the remaining tests that are not in the top 50, we assign them to a random shard, and within that shard we assign it to the group with the lowest duration so far, too
  5. We also update the hash function used to me MurmurHash3 which is known to create balanced assignments in scenarios where the input strings (test names) might have similar prefixes or patterns

Note that purely adding another shard and using a better hash function does NOT yield any better results. That was attempted here: https://github.com/delta-io/delta/pull/3715.

How was this patch tested?

GitHub Ci tests.

https://github.com/delta-io/delta/actions/runs/11004181545?pr=3712

image

Does this PR introduce any user-facing changes?

No.