capitalone / datacompy

Pandas, Polars, and Spark DataFrame comparison for humans and more!
https://capitalone.github.io/datacompy/
Apache License 2.0
479 stars 125 forks source link

DataFrame is highly fragmented warning #188

Closed jpvillemalard closed 1 year ago

jpvillemalard commented 1 year ago

The following error occurs when comparing two dataframes

/usr/local/lib/python3.9/site-packages/datacompy/core.py:342: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling 'frame.insert' many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use 'newframe = frame.copy()'

We acknowledge this is a warning and does not impact the overall process but could concat improve performance? Can we suppress this message from the console?

fdosani commented 1 year ago

Sorry for the delay @jpvillemalard just recovering from my vacation :joy: . Yeah so it isn't an error just a warning as you noted. I'm happy to accept changes if you would be interested in making a PR? Just taking a quick glance I don't see any calls to insert in core.py. So it is being caused by something else?

Do you have some minimal code which could reproduce this warning so I could take a look?

jpvillemalard commented 1 year ago

We are not able to share data for confidential reasons. Basically what we are doing is grabbing data from MemSQL/SingleStore and Databricks databases. For scalability we introduced a batching feature in our code that will break the full data sets into batches of "equal" size. We are not doing any inserts. Maybe it's the fact that we are processing multiple batches, hence multiple data frames which are not defragmented?

fdosani commented 1 year ago

It can be synthetic data, just enough to recreate the actual issue you are having.

jpvillemalard commented 1 year ago

We have decided to use the SparkCompare class instead and leverage our Databricks computer layer.