astronomy-commons / lsdb

Large Survey DataBase
https://lsdb.io
BSD 3-Clause "New" or "Revised" License
19 stars 5 forks source link

Inconsistency in crossmatch column data types #273

Closed camposandro closed 2 months ago

camposandro commented 7 months ago

Bug report

If we decide to keep the non-matches it's possible to get NaN values in our crossmatch dataframe. For every point in the left partitions we will have a row with the left point information and the information of the respective match on the right (which being inexistent will be set to NaN).

When assigning a row with NaN values on a dataframe, Pandas seems to automatically cast the whole column type to "float". Columns such as Norder_{}_xmatch, Dir_{}_xmatch and Npix_{}_xmatch, therefore have an incorrect type.

Screenshot 2024-04-11 at 10 53 55 AM

We should create an end-to-end test to verify that the column data types of the original catalogs remain unchanged.

Before submitting Please check the following:

delucchi-cmu commented 5 months ago

Has this been addressed (or a little bit improved) by the pyarrow dtype changes?

camposandro commented 5 months ago

@delucchi-cmu yes, supporting None values by default using pyarrow should fix the column types. We're holding off on the merge of #271 this week but I might try to build some end-to-end tests in the meantime to make sure the output columns of the crossmatch indeed remain the same!

delucchi-cmu commented 2 months ago

This has been addressed by recent changes to using pyarrow types, and holding on to the pyarrow schema throughout operations.

camposandro commented 2 months ago

We should make sure the Dask DataFrame meta and the pyarrow schema are consistent whenever we address https://github.com/astronomy-commons/lsdb/issues/390.