Closed jpvillemalard closed 1 year ago
Sorry for the delay @jpvillemalard just recovering from my vacation :joy: . Yeah so it isn't an error just a warning as you noted. I'm happy to accept changes if you would be interested in making a PR? Just taking a quick glance I don't see any calls to insert
in core.py
. So it is being caused by something else?
Do you have some minimal code which could reproduce this warning so I could take a look?
We are not able to share data for confidential reasons. Basically what we are doing is grabbing data from MemSQL/SingleStore and Databricks databases. For scalability we introduced a batching feature in our code that will break the full data sets into batches of "equal" size. We are not doing any inserts
. Maybe it's the fact that we are processing multiple batches, hence multiple data frames which are not defragmented?
It can be synthetic data, just enough to recreate the actual issue you are having.
We have decided to use the SparkCompare class instead and leverage our Databricks computer layer.
The following error occurs when comparing two dataframes
/usr/local/lib/python3.9/site-packages/datacompy/core.py:342: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling 'frame.insert' many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use 'newframe = frame.copy()'
We acknowledge this is a warning and does not impact the overall process but could
concat
improve performance? Can we suppress this message from the console?