awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.32k stars 539 forks source link

Fix performance of building row-level results #577

Closed marcantony closed 3 months ago

marcantony commented 3 months ago

Fixes #576

Iteratively calling withColumn (singular) on a DataFrame causes performance issues when iterating over a large sequence of columns. (See issue for more details.) The code was iterating to add each (name, column) pair in a map to the DataFrame, so we can use withColumns, which takes a Map[String, Column] as its parameter, as a drop-in replacement.

After running the performance test in the bug ticket, the performance is much better:

Gathering row-level results
51 columns in row level results
Duration was 67ms

Gathering row-level results
101 columns in row level results
Duration was 53ms

Gathering row-level results
151 columns in row level results
Duration was 49ms

Gathering row-level results
201 columns in row level results
Duration was 45ms

Gathering row-level results
251 columns in row level results
Duration was 48ms

Gathering row-level results
301 columns in row level results
Duration was 48ms

Gathering row-level results
351 columns in row level results
Duration was 41ms

Gathering row-level results
401 columns in row level results
Duration was 41ms

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

marcantony commented 3 months ago

Thanks @mentekid! By the way, any idea how soon a version with this can get published? We're trying to support some users with huge numbers of checks (hundreds to thousands) so I'm hoping to incorporate this in our application soon.

mentekid commented 2 months ago

We don't have a set cadence but I think we can kick off a release some time this week for this and some other changes that have accumulated since our last one.

On Sat, Aug 31, 2024 at 13:58 Josh @.***> wrote:

Thanks @mentekid https://github.com/mentekid! By the way, any idea how soon a version with this can get published? We're trying to support some users with huge numbers of checks (hundreds to thousands) so I'm hoping to incorporate this in our application soon.

— Reply to this email directly, view it on GitHub https://github.com/awslabs/deequ/pull/577#issuecomment-2322994291, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOFPXAAWEOXBHD3BTPHE53ZUH73XAVCNFSM6AAAAABNN6Q7NKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSHE4TIMRZGE . You are receiving this because you were mentioned.Message ID: @.***>

marcantony commented 2 months ago

Hey @mentekid, just wanted to ask about the release again. Could we maybe get something out early next week?

mentekid commented 2 months ago

Hey - what version of spark are you interested in? I think I can kick off the release for 3.5 today, the rest take time as they need separate branches and testing.

marcantony commented 2 months ago

Ah, we're actually using the 3.4 branch. No rush to kick things off today anyway because we're not quite ready on our end to fully take advantage of the performance fix yet.

marcantony commented 2 months ago

Hey @mentekid, I just realized that withColumns was added in Spark 3.3.0 so this is going to cause problems for the lower version builds. I'll make a PR today to replace it with a select instead.