Closed satniks closed 4 months ago
Interesting. It shouldn't take minutes for 5 rows. I'm wondering if something isn't setup right with the new version. It's using pandas on spark api under the hood. I can try and test it later today.
I just ran the example code you pointed to above, and it took maybe 2–3 seconds to return the results. I haven't tested it in Databricks. For context, this is running on my home desktop with just the default Spark settings.
%%timeit
print(compare.report())
...
...
3.66 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Are you able to share your cluster settings?
Thanks @fdosani for quick check.
I am using single node cluster on AWS Databricks with following configuration. There is nothing else running on this cluster.
Runtime: 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12) Note Type: m5d.xlarge 16GB memory, 4 cores Spark Config: spark.master local[*, 4] spark.databricks.cluster.profile singleNode
Nothing in the init script.
Not really sure how to help here. Can you try running on your local (non-cloud) machine and see if it runs faster? I'm using Spark 3.5.1 on Linux.
Not really sure how to help here. Can you try running on your local (non-cloud) machine and see if it runs faster? I'm using Spark 3.5.1 on Linux.
sure. I will check locally. It would great if someone from community test on databricks also.
@satniks I was able to recreate this on databricks. It is odd, but I don't think this is a datacompy issue but rather something under the hood with databricks. I think one thing to consider is that for such small data the SparkCompare will be bad in general. You really need to get into the millions - billions or records since that is what it is intended for. If you have smallish data Polars and Pandas will be better in all cases.
@satniks I was able to recreate this on databricks. It is odd, but I don't think this is a datacompy issue but rather something under the hood with databricks. I think one thing to consider is that for such small data the SparkCompare will be bad in general. You really need to get into the millions - billions or records since that is what it is intended for. If you have smallish data Polars and Pandas will be better in all cases.
Thanks @fdosani for confirmation on small dataset on Databricks. I also tried with few thousands records and it ran for several minutes so I cancelled the job assuming it will not succeed. I will check again for thousands and more records next week and see how long it takes.
Thanks @fdosani for confirmation on small dataset on Databricks. I also tried with few thousands records and it ran for several minutes so I cancelled the job assuming it will not succeed. I will check again for thousands and more records next week and see how long it takes.
Just to confirm I ran it with various sizes and never had issues with it completing. I just did 100M rows and it took 3-4 mins about. I'm going to close this issue for now, but if you have other issues feel free to reopen it.
@fdosani , is it possible to keep legacy spark compare support for a while as it works seamlessly for our datasize?
@satniks of course. I've been working on a vanilla spark version which doesn't use Pandas on spark API. Its much faster if you want to try it out. On this branch: https://github.com/capitalone/datacompy/tree/vanilla-spark
The Legacy version will stick around for a while. I just don't plan any enhancements.
I executed the default spark usage sample on Databricks notebook (compute cluster having Apache Spark 3.5.0). Surprisingly it took more than a minute for these sample dataframes having only 5 rows each. The legacy spark compare works nicely and gives results in few seconds.
Sample code: I just removed spark session creation as it already exists on databricks notebook. https://capitalone.github.io/datacompy/spark_usage.html
Has anyone verified datacompy 0.12 with databricks spark? Does it work as expected with reasonable performance?