capitalone / datacompy

Pandas, Polars, and Spark DataFrame comparison for humans and more!
https://capitalone.github.io/datacompy/
Apache License 2.0
472 stars 125 forks source link

datacompy v0.12 spark sample with 5 rows only takes more than a minute to execute on databricks #300

Closed satniks closed 4 months ago

satniks commented 4 months ago

I executed the default spark usage sample on Databricks notebook (compute cluster having Apache Spark 3.5.0). Surprisingly it took more than a minute for these sample dataframes having only 5 rows each. The legacy spark compare works nicely and gives results in few seconds.

Sample code: I just removed spark session creation as it already exists on databricks notebook. https://capitalone.github.io/datacompy/spark_usage.html

Has anyone verified datacompy 0.12 with databricks spark? Does it work as expected with reasonable performance?

fdosani commented 4 months ago

Interesting. It shouldn't take minutes for 5 rows. I'm wondering if something isn't setup right with the new version. It's using pandas on spark api under the hood. I can try and test it later today.

fdosani commented 4 months ago

I just ran the example code you pointed to above, and it took maybe 2–3 seconds to return the results. I haven't tested it in Databricks. For context, this is running on my home desktop with just the default Spark settings.

%%timeit
print(compare.report())
...
...
3.66 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Are you able to share your cluster settings?

satniks commented 4 months ago

Thanks @fdosani for quick check.

I am using single node cluster on AWS Databricks with following configuration. There is nothing else running on this cluster.

Runtime: 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12) Note Type: m5d.xlarge 16GB memory, 4 cores Spark Config: spark.master local[*, 4] spark.databricks.cluster.profile singleNode

Nothing in the init script.

fdosani commented 4 months ago

Not really sure how to help here. Can you try running on your local (non-cloud) machine and see if it runs faster? I'm using Spark 3.5.1 on Linux.

satniks commented 4 months ago

Not really sure how to help here. Can you try running on your local (non-cloud) machine and see if it runs faster? I'm using Spark 3.5.1 on Linux.

sure. I will check locally. It would great if someone from community test on databricks also.

fdosani commented 4 months ago

@satniks I was able to recreate this on databricks. It is odd, but I don't think this is a datacompy issue but rather something under the hood with databricks. I think one thing to consider is that for such small data the SparkCompare will be bad in general. You really need to get into the millions - billions or records since that is what it is intended for. If you have smallish data Polars and Pandas will be better in all cases.

satniks commented 4 months ago

@satniks I was able to recreate this on databricks. It is odd, but I don't think this is a datacompy issue but rather something under the hood with databricks. I think one thing to consider is that for such small data the SparkCompare will be bad in general. You really need to get into the millions - billions or records since that is what it is intended for. If you have smallish data Polars and Pandas will be better in all cases.

Thanks @fdosani for confirmation on small dataset on Databricks. I also tried with few thousands records and it ran for several minutes so I cancelled the job assuming it will not succeed. I will check again for thousands and more records next week and see how long it takes.

fdosani commented 4 months ago

Thanks @fdosani for confirmation on small dataset on Databricks. I also tried with few thousands records and it ran for several minutes so I cancelled the job assuming it will not succeed. I will check again for thousands and more records next week and see how long it takes.

Just to confirm I ran it with various sizes and never had issues with it completing. I just did 100M rows and it took 3-4 mins about. I'm going to close this issue for now, but if you have other issues feel free to reopen it.

satniks commented 3 months ago

@fdosani , is it possible to keep legacy spark compare support for a while as it works seamlessly for our datasize?

fdosani commented 3 months ago

@satniks of course. I've been working on a vanilla spark version which doesn't use Pandas on spark API. Its much faster if you want to try it out. On this branch: https://github.com/capitalone/datacompy/tree/vanilla-spark

The Legacy version will stick around for a while. I just don't plan any enhancements.