Jan-Willem commented 1 month ago

[x] Fix Antennae_North.cal.lsrk.vis.zarr spwname to be "spw" + str(spw_id). The spw_name in the ms is empty.
[x] Fix ps sel, isel, ms_sel, ms_isel to work with latest version of schema.
[x] Add ps documentationn.
[ ] Review

Review Instructions

Please review the MSv4 processing_set class https://github.com/casangi/xradio/blob/main/src/xradio/vis/_processing_set.py and go over the review tutorial reviews/review_ps.ipynb .

The processing set is a loose collection of MSv4 which might come from multiple MSv2 (or ASDMS). Consequently, arbitrary ids are avoided in favor of descriptive strings.

Run the notebook using:

partition_scheme=['FIELD_ID'] Default behavior. and
partition_scheme=[] Used for VLASS rapid OTF (combines fields).

Key Questions to Answer

1) Is there additional information to display in the summary table? 2) Are the docstrings sufficient? 3) Are there missing data selection use cases? 4) ...

Environment instructions

It is recommended to use the conda environment manager to create a clean, self-contained runtime where xradio and all its dependencies can be installed:

conda create --name xradio python=3.11 --no-default-packages
conda activate xradio

Clone the repository, checkout the review branch and do a local install:

git clone https://github.com/casangi/xradio.git
git checkout 213-fix-ps-selection
cd xradio
pip install -e .

On macOS it is required to pre-install python-casacore using bash conda install -c conda-forge python-casacore.

pford commented 1 month ago

The docstring indicates you can select by column name and value, but you cannot select by float value:

min_freq = min(ps.summary()['start_frequency'])
ps.sel(start_frequency=min_freq).summary()
Traceback:
...
  File "/home/groot/casa/github/xradio/src/xradio/vis/_processing_set.py", line 208, in <lambda>
    return df[df[col].apply(lambda x: any(i in x for i in input_strings))]
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/groot/casa/github/xradio/src/xradio/vis/_processing_set.py", line 208, in <genexpr>
    return df[df[col].apply(lambda x: any(i in x for i in input_strings))]
                                          ^^^^^^
TypeError: argument of type 'float' is not iterable

But it works with a slice: ps.sel(start_frequency=slice(min_freq, min_freq)).summary()

Selecting by string is not exact match:

ps.sel(name='ALMA_uid___A002_X1003af4_X75a3.split.avg_1').summary()['name']
0     ALMA_uid___A002_X1003af4_X75a3.split.avg_13
1     ALMA_uid___A002_X1003af4_X75a3.split.avg_19
2      ALMA_uid___A002_X1003af4_X75a3.split.avg_1
3     ALMA_uid___A002_X1003af4_X75a3.split.avg_14
4     ALMA_uid___A002_X1003af4_X75a3.split.avg_10
5     ALMA_uid___A002_X1003af4_X75a3.split.avg_17
6     ALMA_uid___A002_X1003af4_X75a3.split.avg_15
7     ALMA_uid___A002_X1003af4_X75a3.split.avg_11
8     ALMA_uid___A002_X1003af4_X75a3.split.avg_18
9     ALMA_uid___A002_X1003af4_X75a3.split.avg_16
10    ALMA_uid___A002_X1003af4_X75a3.split.avg_12

but could be useful in other ways:

ps.sel(obs_mode='OBSERVE_TARGET').summary()['obs_mode']
0      OBSERVE_TARGET#ON_SOURCE
1     OBSERVE_TARGET#OFF_SOURCE
etc. (75 matches)

But a similar field_name selection fails (there is no Sun_10_1, but based on the previous examples would have expected fields Sun_10_11, Sun_10_12, Sun_10_13, etc.):

ps.sel(field_name='Sun_10_1').summary()
Empty DataFrame

pford commented 1 month ago

Not a selection issue:

read_processing_set with invalid path (python 3.11)

ps = read_processing_set('ALMA_uid___A002_X1003af4_X75a3.split.avg.zarr')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/groot/casa/github/xradio/src/xradio/vis/read_processing_set.py", line 28, in read_processing_set
file_system, ms_store_list = _get_ms_stores_and_file_system(ps_store)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/groot/casa/github/xradio/src/xradio/_utils/zarr/common.py", line 39, in _get_ms_stores_and_file_system
raise (
TypeError: exceptions must derive from BaseException

Whereas convert_msv2 shows RuntimeError: Table ALMA_uid___A002_X1003af4_X75a3.split.avg.ms does not exist

pford commented 1 month ago

Would it be easier/more efficient to read polarization for the summary from partition_info['polarization_setup'] instead of polarization coordinate values (should be the same, right?) -- https://github.com/casangi/xradio/blob/213-fix-ps-selection/src/xradio/vis/_processing_set.py#L95
It looks like ids are appended to spw_name and field_name even if these names are set and unique in the msv2. So you can't select e.g. field_name="Neptune" because it is now "Neptune_1" in msv4. Is this expected?

pford commented 1 month ago

Re: previous comment about Neptune_1 field name:

I uploaded the 2012 ALMA MS to uid_A002_X49990aX1f.ms to http://www.aoc.nrao.edu/~pford/uidA002_X49990a_X1f.ms.tar.gz (1.7G). I cannot find any duplicate field names, and the similar names (SPTxxxx) already have numbers appended (SPTxxxx-yy), probably to ensure uniqueness. There is only one "Neptune". The converter added an additional number, so the ps field names are:

ps.summary()['field_name'] 0 [J2253+161; 3c454.3_0] 1 [Neptune_1] 2 [J2357-5311_2] 3 [J0133-5200_16] 4 [J2253+161; 3c454.3_0] ...
337 [SPT0109-47_19] 338 [SPT0113-46_20] 339 [SPT0125-47_21] 340 [SPT0125-50_22] 341 [SPT0128-51_23]

Jan-Willem commented 1 month ago

In the latest commit:

Sel by a number should now work.
Added string_exact_match parameter. When true it will only select a row if there is an exact match. If false then Sun_10 will select Sun_10_1, Sun_2, etc (uses sel in row_val).
read_processing_set should give the correct error if PS is not found.
Added scan number and line name to summary.

The review_ps.ipynb has been updated.

Some of the stakeholder tests will fail because the schema has changed. Once we have agreed on the changes I will upload a new version of "s3://viper-test-data/Antennae_North.cal.lsrk.split.v5.vis.zarr" and "Antennae_North.cal.lsrk.split.vis.zarr".

@pford can you please check that the sel/queries you tested work and select a set of sel/query operations that we should add to the stakeholder tests?

pford commented 1 month ago

ps.sel for string and string list, with and without exact match - ok
ps.sel for int, int list - ok
ps.sel for float: summary shows number as 2.299609e+11 but frequency is actually 229960937500.0. So if you select start_frequency=2.299609e+11 it fails (not surprisingly, really). I am not sure there is a use case for this; since the frequencies describe the spectral window, just select the spectral window or use a query. Edit: min() and max() return a float which can be selected.

I will look at ms_sel and ms_isel a bit more. These are not in the review notebook.

pford commented 1 month ago

Summary fails if you ms select a single frequency. Then xds.frequency.values is a scalar not an array. (Selecting an spw with 1 frequency is an array and summary works.)

ps.sel(spw_name=ps.summary().spw_name[0]).ms_isel(frequency=0).summary() Traceback (most recent call last): File "", line 1, in File "/home/groot/casa/github/xradio/src/xradio/vis/_processing_set.py", line 41, in summary self.meta["summary"][data_group] = self._summary(data_group).sort_values( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/groot/casa/github/xradio/src/xradio/vis/_processing_set.py", line 129, in _summary summary_data["start_frequency"].append(value["frequency"].values[0])
IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed

Use case: to do another selection later on the returned ps.

ps.sel(spw_name=ps.summary().spw_name[0]).ms_isel(frequency=0).sel(field_name='Sun_10_29') (same IndexError)

Jan-Willem commented 4 weeks ago

@pford pushed a fix so that summary() works after ms_sel for a single frequency.

casangi / xradio

Review PS.summary and Fix PS selection #213

Review Instructions

Key Questions to Answer

Environment instructions