casangi / xradio

Xarray Radio Astronomy Data IO
https://xradio.readthedocs.io/en/latest/
Other
9 stars 5 forks source link

Nan Int Problem #219

Closed Jan-Willem closed 1 week ago

Jan-Willem commented 1 month ago

In src/xradio/vis/_vis_utils/_utils/xds_helper.py::flatten_xds the nan_int value is used. When this is a very large or small number then things are okay. This value is dependent on the OS (see https://github.com/numpy/numpy/issues/21166). A recent problem is that some versions of Mac convert it to 0 which is a valid value.

    with np.errstate(invalid="ignore"):
        nan_int = np.array([np.nan]).astype("int32")[0]
    txds = xds.copy()
kettenis commented 1 month ago

Well, I'd say NaN doesn't make sense for integers, and you're getting what you're asking for with np.errstate(invalid="ignore"). How do you end up with STATE_IDs and FIELD_IDs like that in the first place?

Jan-Willem commented 1 month ago

The nan_int occurs when we reshape from row->timexbasline and there are missing baselines. The solution is probably just filling these values with some large negative value since they are all ids that are assumed to be positive.

FedeMPouzols commented 4 weeks ago

Yes, this is a tricky issue that we need to keep an eye on. It already came up with some tests that check these int missing values / NaNs: #177. I didn't / wouldn't worry much about flatten_xds(), as it is deprecated/to be deprecated. It allows to experiment going back and forth between the "expanded" (time x baseline) and "flat" (row) arrangements, but I do not think there is any chance that we go back to the flat structure?

But besides flatten_xds() there are a few other places where this behavior could bite us, if we end up having int data variables. For example, when we load MSv2 subtables by row and then "redimension" missing places / NaNs or fill values will appear. We could try to handle this issue by using our own custom missing/fill int values. The fill values could be those "NaN" fill values (-2147483648 / -9223372036854775808) to avoid ambiguities (not ideal but might be the only way)?

I think that the platform that generates 0s from converting float32/64 NaN is mac on M1, but short integer types also seem troublesome. numpy had for a while masked this C undefined behavior with a kind of consistent behavior but now the issue is fully exposed an causing inconsistent behavior between platforms. On the other hand numpy is now a bit more consistent in raising warnings/errors for such conversions. From #177:

This discrepancy should not be a big deal but could bite us if we end up having integer data variables in the final schema.
For example missing "time-baseline" rows in the MS, become "np.nan" values in xdss, with the caveat that for integer data variables they would get whatever special value an np.nan is converted to. It turns out that:
- on linux, nan converted to int32 => -2147483648
- on linux, nan converted to int64 => -9223372036854775808
- but on mac (M1 it seems), nan converted to int => 0

Those 0s could cause ambiguities, but I'd hope that the changes in the xradio schema with respect to the cngi-io schema will remove most if not all those potential ambiguities (especially with ID variables).

The np.errstate(invalid="ignore") takes care of suppressing warnings but the ambiguity is still there. Note that those FIELD_ID and STATE_ID data_vars are no longer there in the recent xradio schema in that same way and should no longer be an issue in the current schema.

This reminded me that with bool variables (flags) the missing/NaN ambiguities can be even more irritating, and one needs to check a companion float/int variable to mask out missing values.

FedeMPouzols commented 1 week ago

This branch has now the fixes needed to prevent issues with risky casts when converting float variables with NaNs to integer. As discussed above, for int32/64 we use custom fill values -2147483648/-9223372036854775808. This also removes most of the np.errstate(invalid="ignore"), which should no longer be needed. So far I don't see any warnings related to this in the tests that I've run locally, and the CI also seems to be happy about it.