delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.51k stars 1.69k forks source link

[Feature Request] Improve unit test execution time #1707

Open felipepessoto opened 1 year ago

felipepessoto commented 1 year ago

Feature request

Overview

The unit tests are taking longer every new version. As a reference, the build in this PR, from a year ago took 61-77 minutes: https://github.com/delta-io/delta/pull/887

image

Motivation

I think we need to improve it before it becomes out of control.

Further details

Parallel tests are disabled:

    // Don't execute in parallel since we can't have multiple Sparks in the same JVM
    Test / parallelExecution := false,

Do we have any alternatives?

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

felipepessoto commented 1 year ago

Since I opened the issue, tests went from ~230m, to more than 360m (they are timing out)

felipepessoto commented 1 year ago

@edmondop, @tdas, @scottsand-db, @allisonport-db (from #1249)

We could add tags (Group1, Group2....Group 10) to unit tests and change run-tests.py adding a testOnly argument, and a runPythonTests.

It would work independently of the infra being used. You could start several VMs/Agents each one calling run-tests.py with different tags. What you think? I can send a PR, but would like to confirm if somebody will be able to review it.

scottsand-db commented 1 year ago

Hi @felipepessoto

We could add tags (Group1, Group2....Group 10) to unit tests

How would this work? Is this a manual process? Would we have to enforce this on all existing code and all new PRs?

felipepessoto commented 1 year ago

It is manual. My suggestion is to add a dedicated group for big tests, like Merge and CDC, and split the remaining in Group1, Group2.... And it is up to the pipeline how to run it. Usually, we would start a new VM for each group + one for tests without groups + one for Java (for some reason the filter for tests without groups doesn't work):

* -- -n org.apache.spark.sql.delta.testtags.DeltaTestsMergeTag
* -- -n org.apache.spark.sql.delta.testtags.DeltaTestsCDCTag
* -- -l org.apache.spark.sql.delta.testtags.DeltaTestsMergeTag -l org.apache.spark.sql.delta.testtags.DeltaTestsCDCTag
io.delta.sql.JavaDeltaSparkSessionExtensionSuite io.delta.tables.JavaDeltaTableBuilderSuite...

New tests would land on "other" categories if not tagged