casangi / xradio

Xarray Radio Astronomy Data IO
https://xradio.readthedocs.io/en/latest/
Other
9 stars 5 forks source link

Failures to convert MSs from casatestdata, related to latest field_source_info: fix or double-check #196

Closed FedeMPouzols closed 1 month ago

FedeMPouzols commented 1 month ago

So far the following issues:

A few remaining issues being investigated (20240725), with partition_scheme=["FIELD_ID"]:

Without "FIELD_ID" partitioning, we have the same 3 as above and:

FedeMPouzols commented 1 month ago

After the VLASS OTF PR, several ALMASD datasets produce an error in extract_source_info(): ValueError('different number of dimensions on data and dims: 3 vs 2')

SOURCE/DIRECTION is normally as a [1, 2] size array, which uses only one dimension for the coordinates when loaded into an xds. But in some datasets it seems to be transposed, it is given as a [2, 1] size array. That produces an additional dimension in the DIRECTION variable: (DIRECTION (SOURCE_ID, TIME, SPECTRAL_WINDOW_ID, dim_1, dim_2)) which after the selection isel(TIME=0, SPECTRAL_WINDOW_ID=0, drop=True) in extract_source_info() remains as: DIRECTION (SOURCE_ID, dim_1, dim_2) That is the issue. This branch now has a fix for that, which drops the unexpected and 1-sized dimension if it is present.

Another amusing point is that these example SOURCE subtables (sdimaging_flagtest.ms, selection_intent.ms, selection_misc.m, etc.) also have the column PULSAR_ID, set to 0.

All these example MSs are produced by the CASA simulator, which is probably the source of the issue. Similar failures must happen in the other groups of test MSs (ALMA, VLA, Other, etc.) but those are currently masked behind more common errors that trigger some of the early assert in extract_source_info().

FedeMPouzols commented 1 month ago

After the last (second) commit the common AssertionError('Can only process source table with a single time entry for a source_id and spectral_window_id.') issue seems fixed. I think we should still improve these asserts to turn them into exceptions, and the check could be more strict and ensure that for every source_id time is unique (perhaps with a loop of selections of individual unique source_id).

The count of errors is down from ~66 to ~8, at least for now.

FedeMPouzols commented 1 month ago

With the latest commits (which also brings via main some of 168-review-ms_xdsattrsantenna_xds-schema-and-xradio-interface), the issues in ALMASD and EVLA datasets seem all fixed. We are down to: VLA: 1 failure (some dimensionality mismatch) ALMA: 5 failures (2 problems with SOURCE_ID and 2 with EPHEMERIS_ID (+ crazySourceTable.ms which is probably an acceptable failure)) Others: 2 failures related to dimension sizes.

FedeMPouzols commented 1 month ago

After the last commits above this comment the remaining issues seem to be a handful of specific MSs:

With partition_scheme=["FiELD_ID"]:

Without "FIELD_ID" partitioning, we have the same 3 as above and:

FedeMPouzols commented 1 month ago

After the last few commits I see only one failure left, with crazySourceTable.ms, which produces a legitimate: "Can only process source table with a single time entry for a source_id and spectral_window_id." (see reasons in the issue description).