astronomy-commons / hipscat-import

HiPSCat import - generate HiPSCat-partitioned catalogs
https://hipscat-import.readthedocs.io
BSD 3-Clause "New" or "Revised" License
5 stars 3 forks source link

PerformanceWarning on catalog import #251

Open delucchi-cmu opened 5 months ago

delucchi-cmu commented 5 months ago

Bug report

I saw a new warning this morning, when trying out a DP0 import:

/astro/users/mmd11/git/hipscat-import/src/hipscat_import/catalog/map_reduce.py:289: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  dataframe["Norder"] = np.full(rows_written, fill_value=healpix_pixel.order, dtype=np.uint8)
/astro/users/mmd11/git/hipscat-import/src/hipscat_import/catalog/map_reduce.py:290: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  dataframe["Dir"] = np.full(rows_written, fill_value=healpix_pixel.dir, dtype=np.uint64)
/astro/users/mmd11/git/hipscat-import/src/hipscat_import/catalog/map_reduce.py:291: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  dataframe["Npix"] = np.full(rows_written, fill_value=healpix_pixel.pixel, dtype=np.uint64)

We should try out the suggested pd.concat(axis=1) approach.

Before submitting Please check the following:

delucchi-cmu commented 5 months ago

I suspect this problem is coming up now because the DP0 table is super-duper-gooper wide (900+ columns).