casangi / xradio

Xarray Radio Astronomy Data IO
https://xradio.readthedocs.io/en/latest/
Other
16 stars 7 forks source link

Use of string arrays in the schema (especially unicode) #294

Open sstansill opened 1 month ago

sstansill commented 1 month ago

For next generation telescopes (SKA, ngVLA), the zarr-python library likely won't provide a fast enough interface to MSv4 datasets. Instead, libraries written in lower level languages will be used and the MSv4 schema should be compatible with these libraries. In particular, the SKAO has used Google's TensorStore to prototype MSv4 support in WSClean.

The problem is that arrays with unicode datatypes aren't supported by any of the C/C++ zarr implementations listed here https://zarr.dev/implementations/. So, I propose that null-terminated byte sequences "<S" should be used in place of unicode "<U" data types for arrays (there are 59 instances of unicode dtypes in v4.0.0 of the schema).

Additionally, variable / unknown length strings ("<U0" and "<S0") should be avoided wherever possible to reduce the amount of data stored on disk and improve the speed of opening a dataset--all coordinates are read eagerly and variable length strings are slower to parse. For example, the polarization coordinate should have dtype "<S2". For the coordinates baseline_antenna1_name and baseline_antenna2_name, it may be best to revert to integer arrays. The names corresponding to an antenna index can be any length which leads to larger metadata and more verbose code--the long-format antenna names should be reserved for AntennaXds.