astronomy-commons / lsdb

Large Survey DataBase
https://lsdb.io
BSD 3-Clause "New" or "Revised" License
19 stars 5 forks source link

Cross-matching fails with ArrowInvalid error #335

Closed hombit closed 5 months ago

hombit commented 5 months ago

Bug report

import lsdb

ztf_path = 'https://epyc.astro.washington.edu/~lincc-frameworks/hipscat_surveys/ztf/ztf_dr14/'
gaia_path = 'https://epyc.astro.washington.edu/~lincc-frameworks/hipscat_surveys/gaia_dr3/gaia/'
gaia_margin_path = 'https://epyc.astro.washington.edu/~lincc-frameworks/hipscat_surveys/gaia_dr3/gaia_10arcs/'

cone_search = dict(ra=283.44639, dec=33.06623, radius_arcsec=3600)

ztf_catalog = lsdb.read_hipscat(ztf_path, columns=["ra", "dec"]).cone_search(**cone_search)
gaia_margin_catalog = lsdb.read_hipscat(gaia_margin_path, columns=["ra", "dec"])
gaia_catalog = lsdb.read_hipscat(
    gaia_path,
    columns=["ra", "dec"],
    margin_cache=gaia_margin_catalog,
).cone_search(**cone_search)

lsdb_result = ztf_catalog.crossmatch(gaia_catalog).compute()
Expand ``` --------------------------------------------------------------------------- ArrowInvalid Traceback (most recent call last) Cell In[1], line 17 10 gaia_margin_catalog = lsdb.read_hipscat(gaia_margin_path, columns=["ra", "dec"]) 11 gaia_catalog = lsdb.read_hipscat( 12 gaia_path, 13 columns=["ra", "dec"], 14 margin_cache=gaia_margin_catalog, 15 ).cone_search(**cone_search) ---> 17 lsdb_result = ztf_catalog.crossmatch(gaia_catalog).compute() File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/lsdb/catalog/dataset/dataset.py:39, in Dataset.compute(self) 37 def compute(self) -> pd.DataFrame: 38 """Compute dask distributed dataframe to pandas dataframe""" ---> 39 return self._ddf.compute() File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/dask/base.py:375, in DaskMethodsMixin.compute(self, **kwargs) 351 def compute(self, **kwargs): 352 """Compute this dask collection 353 354 This turns a lazy Dask collection into its in-memory equivalent. (...) 373 dask.compute 374 """ --> 375 (result,) = compute(self, traverse=False, **kwargs) 376 return result File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/dask/base.py:661, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs) 658 postcomputes.append(x.__dask_postcompute__()) 660 with shorten_traceback(): --> 661 results = schedule(dsk, keys, **kwargs) 663 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)]) File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/lsdb/dask/crossmatch_catalog_data.py:59, in perform_crossmatch(left_df, right_df, right_margin_df, left_pix, right_pix, right_margin_pix, left_catalog_info, right_catalog_info, right_margin_catalog_info, algorithm, suffixes, right_columns, meta_df, **kwargs) 56 if len(left_df) == 0: 57 return meta_df ---> 59 right_joined_df = concat_partition_and_margin(right_df, right_margin_df, right_columns) 61 return algorithm( 62 left_df, 63 right_joined_df, (...) 71 suffixes, 72 ).crossmatch(**kwargs) File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/lsdb/dask/merge_catalog_functions.py:55, in concat_partition_and_margin(partition, margin, right_columns) 53 margin_renamed = margin[margin_columns_no_hive].rename(columns=rename_columns) 54 margin_filtered = margin_renamed[right_columns] ---> 55 joined_df = pd.concat([partition, margin_filtered]) if margin_filtered is not None else partition 56 return joined_df File properties.pyx:36, in pandas._libs.properties.CachedProperty.__get__() File properties.pyx:36, in pandas._libs.properties.CachedProperty.__get__() File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/pyarrow/table.pxi:574, in pyarrow.lib.ChunkedArray.cast() File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/pyarrow/compute.py:404, in cast(arr, target_type, safe, options, memory_pool) 402 else: 403 options = CastOptions.safe(target_type) --> 404 return call_function("cast", [arr], options, memory_pool) File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/pyarrow/_compute.pyx:590, in pyarrow._compute.call_function() File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/pyarrow/_compute.pyx:385, in pyarrow._compute.Function.call() File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status() File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status() ArrowInvalid: Integer value 4180559770064781312 not in range: 0 to 9007199254740992 ```

It is the same error as if I would try to cast a large uint64 to double:

pa.scalar(4180559770064781312, type=pa.uint64()).cast(pa.float64())

Before submitting Please check the following:

hombit commented 5 months ago

I created a pandas issue for this, but I'm not 100% sure that we have this issue... https://github.com/pandas-dev/pandas/issues/58819

hombit commented 5 months ago

It looks like the root of the issue is that margin catalog has a wrong index dtype

import lsdb

gaia_margin_path = 'https://epyc.astro.washington.edu/~lincc-frameworks/hipscat_surveys/gaia_dr3/gaia_10arcs/'
gaia_margin_catalog = lsdb.read_hipscat(gaia_margin_path, columns=["ra", "dec"])
print(gaia_margin_catalog.index.dtype)

prints int64[pyarrow]

hombit commented 5 months ago

It happened to be the margin catalog issue, fixed by https://github.com/astronomy-commons/hipscat-import/pull/325