casangi / xradio

Xarray Radio Astronomy Data IO
Other
9 stars 5 forks source link

processing_set.summary broken. #159

Closed vsuorant closed 2 months ago

vsuorant commented 2 months ago

ps.summary() fails with KeyError: 'field_info'

import pandas as pd

pd.options.display.max_colwidth = 100 ps_store = "s3://viper-test-data/Antennae_North.cal.lsrk.split.vis.zarr"

from xradio.vis.read_processing_set import read_processing_set

intents = ["OBSERVE_TARGET#ON_SOURCE"] fields = None ps = read_processing_set( ps_store="s3://viper-test-data/Antennae_North.cal.lsrk.split.vis.zarr", intents=intents, fields=fields, ) display(ps.summary())


KeyError Traceback (most recent call last) Cell In[35], line 15 9 fields = None 10 ps = read_processing_set( 11 ps_store="s3://viper-test-data/Antennae_North.cal.lsrk.split.vis.zarr", 12 intents=intents, 13 fields=fields, 14 ) ---> 15 display(ps.summary())

File /opt/conda/lib/python3.10/site-packages/xradio/vis/_processing_set.py:19, in (self, data_group) 17 return self.meta["summary"][data_group] 18 else: ---> 19 self.meta["summary"][data_group] = self._summary(data_group) 20 return self.meta["summary"][data_group]

File /opt/conda/lib/python3.10/site-packages/xradio/vis/_processing_set.py:64, in processing_set._summary(self, data_group) 59 data_name = value.attrs["data_groups"][data_group]["spectrum"] 61 summary_data["shape"].append(value[data_name].shape) 63 summary_data["field_id"].append( ---> 64 value[data_name].attrs["field_info"]["field_id"] 65 ) 66 summary_data["field_name"].append( 67 value[data_name].attrs["field_info"]["name"] 68 ) 69 summary_data["start_frequency"].append(value["frequency"].values[0])

KeyError: 'field_info'

amcnicho commented 2 months ago

It looks like there is a difference between the dataset produced by

graphviper.utils.data.download(file="Antennae_North.cal.lsrk.split.ms")
xradio.vis.convert_msv2_to_processing_set(
    in_file="Antennae_North.cal.lsrk.split.ms",
    out_file="Antennae_North.cal.lsrk.split.vis.zarr",
    partition_scheme="ddi_intent_field",
)
local_ps = read_processing_set("Antennae_North.cal.lsrk.split.vis.zarr",  intents = ["OBSERVE_TARGET#ON_SOURCE"])

and the dataset returned by

cloud_ps = xradio.vis.read_processing_set(
    "s3://viper-test-data/Antennae_North.cal.lsrk.split.vis.zarr", 
    intents = ["OBSERVE_TARGET#ON_SOURCE"]
)

This can be verified by running the following command which returns False for every MSv4 key:

[local_ps[key].identical(cloud_ps[key]) for key in local_ps.keys()]

This might not necessarily mean that the function is broken, but rather the dataset uploaded to S3 might be incomplete or incorrect. From my inspection it seems that the only difference between the local_ps and cloud_ps is the presence of three attribute keys: weather_xds and pointing_xds are unique to local_ps field_info is unique to cloud_ps This is true for all three of the MSv4 objects inside the processing sets. Someone who knows more about this dataset should comment whether they think this looks like the function isn't working properly (I agree with Ville that raising an exception upon encountering unexpected attributes does not seem great) and if the dataset uploaded to S3 seems incomplete or incorrect.

vsuorant commented 2 months ago

I don't know what this comment refers to: "(I agree with Ville that raising an exception upon encountering unexpected attributes does not seem great)". I assumed that field_info was required and saw that it had been reshuffled in the schema recently, so I assumed that this was likely limited to this particular function. I didn't think to check the dataset for anomalies.

FedeMPouzols commented 2 months ago

The S3 version shows traits that would imply/require a rather old version of xradio/convert_msv2_to_processing_set, pre-dating these PRs (from Feb): https://github.com/casangi/xradio/pull/138, https://github.com/casangi/xradio/pull/127, https://github.com/casangi/xradio/pull/126) - note that the S3 PS does have 'field_info' but in the top level MSv4 attrs.

In those Feb PRs the field_info was moved from the general MSv4 attrs to inside the vis data variables attrs, and the weather_xds and pointing_xds were added. When was the S3 version produced? Is it easy to change it to a newly converted MS (with current xradio)?

amcnicho commented 2 months ago

Producing and uploading the dataset to S3 took place at the end of March well after those PRs had merged. It's possible that an old release was used to run the conversion, but I've been using an editable local installation so it should have been recent enough to contain those changes. It is not hard to change the contents of the bucket to a freshly-converted MS. I'll try that and see if it fixes the problem.

vsuorant commented 2 months ago

@amcnicho Fixed the data set and now the ps.summary worked again.