Closed marcantony closed 3 months ago
Thanks @mentekid! By the way, any idea how soon a version with this can get published? We're trying to support some users with huge numbers of checks (hundreds to thousands) so I'm hoping to incorporate this in our application soon.
We don't have a set cadence but I think we can kick off a release some time this week for this and some other changes that have accumulated since our last one.
On Sat, Aug 31, 2024 at 13:58 Josh @.***> wrote:
Thanks @mentekid https://github.com/mentekid! By the way, any idea how soon a version with this can get published? We're trying to support some users with huge numbers of checks (hundreds to thousands) so I'm hoping to incorporate this in our application soon.
— Reply to this email directly, view it on GitHub https://github.com/awslabs/deequ/pull/577#issuecomment-2322994291, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOFPXAAWEOXBHD3BTPHE53ZUH73XAVCNFSM6AAAAABNN6Q7NKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSHE4TIMRZGE . You are receiving this because you were mentioned.Message ID: @.***>
Hey @mentekid, just wanted to ask about the release again. Could we maybe get something out early next week?
Hey - what version of spark are you interested in? I think I can kick off the release for 3.5 today, the rest take time as they need separate branches and testing.
Ah, we're actually using the 3.4 branch. No rush to kick things off today anyway because we're not quite ready on our end to fully take advantage of the performance fix yet.
Hey @mentekid, I just realized that withColumns
was added in Spark 3.3.0 so this is going to cause problems for the lower version builds. I'll make a PR today to replace it with a select
instead.
Fixes #576
Iteratively calling
withColumn
(singular) on a DataFrame causes performance issues when iterating over a large sequence of columns. (See issue for more details.) The code was iterating to add each(name, column)
pair in a map to the DataFrame, so we can usewithColumns
, which takes aMap[String, Column]
as its parameter, as a drop-in replacement.After running the performance test in the bug ticket, the performance is much better:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.