capitalone / datacompy

Pandas, Polars, and Spark DataFrame comparison for humans and more!
https://capitalone.github.io/datacompy/
Apache License 2.0
420 stars 124 forks source link

[Discussion] Deprecate the native Spark implementation in favour of Fugue or Pandas on Spark #274

Closed fdosani closed 3 months ago

fdosani commented 3 months ago

@jdawang @ak-gupta @NikhilJArora Want your opinions on the above.

So right now the native spark implementation is a bit different than the Pandas, Polars, and Fugue versions.

I'm thinking we can "deprecate" it while leaving it around for backwards compatibility for the next little while. If folks want to continue to use it they can explicitly import it. Maybe we can rename it LegacySparkCompare or something. My main goal is to consolidate and clean up the package.

We have a Pandas on Spark implementation which mimics the Pandas logic much closer, but obviously is also doing a lot more than this version. Based on the differences the performance is a bit lagging.

fdosani commented 3 months ago

Reference for the Pandas on Spark implementation: https://github.com/capitalone/datacompy/pull/195