casangi / xradio

Xarray Radio Astronomy Data IO
https://xradio.readthedocs.io/en/latest/
Other
9 stars 5 forks source link

Review PS.summary and Fix PS selection #213

Closed Jan-Willem closed 3 weeks ago

Jan-Willem commented 1 month ago

Review Instructions

Please review the MSv4 processing_set class https://github.com/casangi/xradio/blob/main/src/xradio/vis/_processing_set.py and go over the review tutorial reviews/review_ps.ipynb .

The processing set is a loose collection of MSv4 which might come from multiple MSv2 (or ASDMS). Consequently, arbitrary ids are avoided in favor of descriptive strings.

Run the notebook using:

Key Questions to Answer

1) Is there additional information to display in the summary table? 2) Are the docstrings sufficient? 3) Are there missing data selection use cases? 4) ...

Environment instructions

It is recommended to use the conda environment manager to create a clean, self-contained runtime where xradio and all its dependencies can be installed:

conda create --name xradio python=3.11 --no-default-packages
conda activate xradio

Clone the repository, checkout the review branch and do a local install:

git clone https://github.com/casangi/xradio.git
git checkout 213-fix-ps-selection
cd xradio
pip install -e .

On macOS it is required to pre-install python-casacore using bash conda install -c conda-forge python-casacore.

pford commented 1 month ago
min_freq = min(ps.summary()['start_frequency'])
ps.sel(start_frequency=min_freq).summary()
Traceback:
...
  File "/home/groot/casa/github/xradio/src/xradio/vis/_processing_set.py", line 208, in <lambda>
    return df[df[col].apply(lambda x: any(i in x for i in input_strings))]
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/groot/casa/github/xradio/src/xradio/vis/_processing_set.py", line 208, in <genexpr>
    return df[df[col].apply(lambda x: any(i in x for i in input_strings))]
                                          ^^^^^^
TypeError: argument of type 'float' is not iterable

But it works with a slice: ps.sel(start_frequency=slice(min_freq, min_freq)).summary()

ps.sel(name='ALMA_uid___A002_X1003af4_X75a3.split.avg_1').summary()['name']
0     ALMA_uid___A002_X1003af4_X75a3.split.avg_13
1     ALMA_uid___A002_X1003af4_X75a3.split.avg_19
2      ALMA_uid___A002_X1003af4_X75a3.split.avg_1
3     ALMA_uid___A002_X1003af4_X75a3.split.avg_14
4     ALMA_uid___A002_X1003af4_X75a3.split.avg_10
5     ALMA_uid___A002_X1003af4_X75a3.split.avg_17
6     ALMA_uid___A002_X1003af4_X75a3.split.avg_15
7     ALMA_uid___A002_X1003af4_X75a3.split.avg_11
8     ALMA_uid___A002_X1003af4_X75a3.split.avg_18
9     ALMA_uid___A002_X1003af4_X75a3.split.avg_16
10    ALMA_uid___A002_X1003af4_X75a3.split.avg_12

but could be useful in other ways:

ps.sel(obs_mode='OBSERVE_TARGET').summary()['obs_mode']
0      OBSERVE_TARGET#ON_SOURCE
1     OBSERVE_TARGET#OFF_SOURCE
etc. (75 matches)

But a similar field_name selection fails (there is no Sun_10_1, but based on the previous examples would have expected fields Sun_10_11, Sun_10_12, Sun_10_13, etc.):

ps.sel(field_name='Sun_10_1').summary()
Empty DataFrame
pford commented 1 month ago

Not a selection issue:

pford commented 1 month ago
pford commented 1 month ago

Re: previous comment about Neptune_1 field name:

I uploaded the 2012 ALMA MS to uid_A002_X49990aX1f.ms to http://www.aoc.nrao.edu/~pford/uidA002_X49990a_X1f.ms.tar.gz (1.7G). I cannot find any duplicate field names, and the similar names (SPTxxxx) already have numbers appended (SPTxxxx-yy), probably to ensure uniqueness. There is only one "Neptune". The converter added an additional number, so the ps field names are:

ps.summary()['field_name'] 0 [J2253+161; 3c454.3_0] 1 [Neptune_1] 2 [J2357-5311_2] 3 [J0133-5200_16] 4 [J2253+161; 3c454.3_0] ...
337 [SPT0109-47_19] 338 [SPT0113-46_20] 339 [SPT0125-47_21] 340 [SPT0125-50_22] 341 [SPT0128-51_23]

Jan-Willem commented 1 month ago

In the latest commit:

The review_ps.ipynb has been updated.

Some of the stakeholder tests will fail because the schema has changed. Once we have agreed on the changes I will upload a new version of "s3://viper-test-data/Antennae_North.cal.lsrk.split.v5.vis.zarr" and "Antennae_North.cal.lsrk.split.vis.zarr".

@pford can you please check that the sel/queries you tested work and select a set of sel/query operations that we should add to the stakeholder tests?

pford commented 1 month ago

I will look at ms_sel and ms_isel a bit more. These are not in the review notebook.

pford commented 1 month ago

Summary fails if you ms select a single frequency. Then xds.frequency.values is a scalar not an array. (Selecting an spw with 1 frequency is an array and summary works.)

ps.sel(spw_name=ps.summary().spw_name[0]).ms_isel(frequency=0).summary() Traceback (most recent call last): File "", line 1, in File "/home/groot/casa/github/xradio/src/xradio/vis/_processing_set.py", line 41, in summary self.meta["summary"][data_group] = self._summary(data_group).sort_values( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/groot/casa/github/xradio/src/xradio/vis/_processing_set.py", line 129, in _summary summary_data["start_frequency"].append(value["frequency"].values[0])


IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed

Use case: to do another selection later on the returned ps.

ps.sel(spw_name=ps.summary().spw_name[0]).ms_isel(frequency=0).sel(field_name='Sun_10_29') (same IndexError)

Jan-Willem commented 4 weeks ago

@pford pushed a fix so that summary() works after ms_sel for a single frequency.