Inconsistency in crossmatch column data types

camposandro commented 7 months ago

Bug report

If we decide to keep the non-matches it's possible to get NaN values in our crossmatch dataframe. For every point in the left partitions we will have a row with the left point information and the information of the respective match on the right (which being inexistent will be set to NaN).

When assigning a row with NaN values on a dataframe, Pandas seems to automatically cast the whole column type to "float". Columns such as Norder_{}_xmatch, Dir_{}_xmatch and Npix_{}_xmatch, therefore have an incorrect type.

We should create an end-to-end test to verify that the column data types of the original catalogs remain unchanged.

Before submitting Please check the following:

[X] I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
[X] I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead.
[X] If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.

delucchi-cmu commented 5 months ago

Has this been addressed (or a little bit improved) by the pyarrow dtype changes?

camposandro commented 5 months ago

@delucchi-cmu yes, supporting None values by default using pyarrow should fix the column types. We're holding off on the merge of #271 this week but I might try to build some end-to-end tests in the meantime to make sure the output columns of the crossmatch indeed remain the same!

delucchi-cmu commented 2 months ago

This has been addressed by recent changes to using pyarrow types, and holding on to the pyarrow schema throughout operations.

camposandro commented 2 months ago

We should make sure the Dask DataFrame meta and the pyarrow schema are consistent whenever we address https://github.com/astronomy-commons/lsdb/issues/390.

astronomy-commons / lsdb

Inconsistency in crossmatch column data types #273