Closed vsuorant closed 2 months ago
It looks like there is a difference between the dataset produced by
graphviper.utils.data.download(file="Antennae_North.cal.lsrk.split.ms")
xradio.vis.convert_msv2_to_processing_set(
in_file="Antennae_North.cal.lsrk.split.ms",
out_file="Antennae_North.cal.lsrk.split.vis.zarr",
partition_scheme="ddi_intent_field",
)
local_ps = read_processing_set("Antennae_North.cal.lsrk.split.vis.zarr", intents = ["OBSERVE_TARGET#ON_SOURCE"])
and the dataset returned by
cloud_ps = xradio.vis.read_processing_set(
"s3://viper-test-data/Antennae_North.cal.lsrk.split.vis.zarr",
intents = ["OBSERVE_TARGET#ON_SOURCE"]
)
This can be verified by running the following command which returns False
for every MSv4 key:
[local_ps[key].identical(cloud_ps[key]) for key in local_ps.keys()]
This might not necessarily mean that the function is broken, but rather the dataset uploaded to S3 might be incomplete or incorrect. From my inspection it seems that the only difference between the local_ps
and cloud_ps
is the presence of three attribute keys:
weather_xds
and pointing_xds
are unique to local_ps
field_info
is unique to cloud_ps
This is true for all three of the MSv4 objects inside the processing sets. Someone who knows more about this dataset should comment whether they think this looks like the function isn't working properly (I agree with Ville that raising an exception upon encountering unexpected attributes does not seem great) and if the dataset uploaded to S3 seems incomplete or incorrect.
I don't know what this comment refers to: "(I agree with Ville that raising an exception upon encountering unexpected attributes does not seem great)". I assumed that field_info was required and saw that it had been reshuffled in the schema recently, so I assumed that this was likely limited to this particular function. I didn't think to check the dataset for anomalies.
The S3 version shows traits that would imply/require a rather old version of xradio/convert_msv2_to_processing_set, pre-dating these PRs (from Feb): https://github.com/casangi/xradio/pull/138, https://github.com/casangi/xradio/pull/127, https://github.com/casangi/xradio/pull/126) - note that the S3 PS does have 'field_info' but in the top level MSv4 attrs.
In those Feb PRs the field_info was moved from the general MSv4 attrs to inside the vis data variables attrs, and the weather_xds
and pointing_xds
were added. When was the S3 version produced? Is it easy to change it to a newly converted MS (with current xradio)?
Producing and uploading the dataset to S3 took place at the end of March well after those PRs had merged. It's possible that an old release was used to run the conversion, but I've been using an editable local installation so it should have been recent enough to contain those changes. It is not hard to change the contents of the bucket to a freshly-converted MS. I'll try that and see if it fixes the problem.
@amcnicho Fixed the data set and now the ps.summary worked again.
ps.summary() fails with KeyError: 'field_info'
import pandas as pd
pd.options.display.max_colwidth = 100 ps_store = "s3://viper-test-data/Antennae_North.cal.lsrk.split.vis.zarr"
from xradio.vis.read_processing_set import read_processing_set
intents = ["OBSERVE_TARGET#ON_SOURCE"] fields = None ps = read_processing_set( ps_store="s3://viper-test-data/Antennae_North.cal.lsrk.split.vis.zarr", intents=intents, fields=fields, ) display(ps.summary())
KeyError Traceback (most recent call last) Cell In[35], line 15 9 fields = None 10 ps = read_processing_set( 11 ps_store="s3://viper-test-data/Antennae_North.cal.lsrk.split.vis.zarr", 12 intents=intents, 13 fields=fields, 14 ) ---> 15 display(ps.summary())
File /opt/conda/lib/python3.10/site-packages/xradio/vis/_processing_set.py:19, in (self, data_group) 17 return self.meta["summary"][data_group] 18 else: ---> 19 self.meta["summary"][data_group] = self._summary(data_group) 20 return self.meta["summary"][data_group]
File /opt/conda/lib/python3.10/site-packages/xradio/vis/_processing_set.py:64, in processing_set._summary(self, data_group) 59 data_name = value.attrs["data_groups"][data_group]["spectrum"] 61 summary_data["shape"].append(value[data_name].shape) 63 summary_data["field_id"].append( ---> 64 value[data_name].attrs["field_info"]["field_id"] 65 ) 66 summary_data["field_name"].append( 67 value[data_name].attrs["field_info"]["name"] 68 ) 69 summary_data["start_frequency"].append(value["frequency"].values[0])
KeyError: 'field_info'