casangi / xradio

Xarray Radio Astronomy Data IO
https://xradio.readthedocs.io/en/latest/
Other
9 stars 5 forks source link

Inconsistent chunking of coordinates #215

Open maneesh29s opened 1 month ago

maneesh29s commented 1 month ago

We have a simulated MSv2 data which we use for testing purposes on our workstations. The data has following dimensions:

Time: 120 Baseline: 1,30,816 Channels: 150 Polarizations: 1 (XX)

The overall size of the data is around 16 GB.

We convert this MSv2 data to processing set, with the default partition scheme, and specifying main_chunksize={"frequency": 1}). We read the converted data using read_processing_set and store it in ps variable. ( We also perform ps = ps.get(0) to read the partition)

In ps, the output of VISIBILITY dataarray is as follows:

In [79]: ps.VISIBILITY
Out[79]:
<xarray.DataArray 'VISIBILITY' (time: 120, baseline_id: 130816, frequency: 150,
                                polarization: 1)> Size: 19GB
dask.array<open_dataset-VISIBILITY, shape=(120, 130816, 150, 1), dtype=complex64, chunksize=(120, 130816, 1, 1), chunktype=numpy.ndarray>
Coordinates:
    baseline_antenna1_id  (baseline_id) int32 523kB dask.array<chunksize=(65408,), meta=np.ndarray>
    baseline_antenna2_id  (baseline_id) int32 523kB dask.array<chunksize=(65408,), meta=np.ndarray>
  * baseline_id           (baseline_id) int64 1MB 0 1 2 ... 130813 130814 130815
  * frequency             (frequency) float64 1kB 1.425e+08 ... 1.574e+08
  * polarization          (polarization) <U2 8B 'XX'
  * time                  (time) float64 960B 9.467e+08 9.467e+08 ... 9.467e+08
Attributes:
    type:                  quanta
    units:                 ['unkown']
    field_and_source_xds:  <xarray.Dataset> Size: 52B\nDimensions: 

In the above output, even though the chunks on VISIBILITY data are as expected, we can also see the the co-ordinates baseline_antenna1_id and baseline_antenna2_id are also chunked on dimension baseline_id which was not specified during the conversion.

Because of this inconsistency, trying to read xarray's chunksizes attribute on any of the Dataarrays inside ps fails

In [85]: ps.VISIBILITY.chunksizes
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[85], line 1
----> 1 ps.VISIBILITY.chunksizes

File /opt/miniconda3/envs/xradio/lib/python3.11/site-packages/xarray/core/dataarray.py:1335, in DataArray.chunksizes(self)
   1320 """
   1321 Mapping from dimension names to block lengths for this dataarray's data, or None if
   1322 the underlying data is not a dask array.
   (...)
   1332 xarray.unify_chunks
   1333 """
   1334 all_variables = [self.variable] + [c.variable for c in self.coords.values()]
-> 1335 return get_chunksizes(all_variables)

File /opt/miniconda3/envs/xradio/lib/python3.11/site-packages/xarray/core/common.py:2055, in get_chunksizes(variables)
   2053         for dim, c in v.chunksizes.items():
   2054             if dim in chunks and c != chunks[dim]:
-> 2055                 raise ValueError(
   2056                     f"Object has inconsistent chunks along dimension {dim}. "
   2057                     "This can be fixed by calling unify_chunks()."
   2058                 )
   2059             chunks[dim] = c
   2060 return Frozen(chunks)

ValueError: Object has inconsistent chunks along dimension baseline_id. This can be fixed by calling unify_chunks().

And If unify_chunks() is called on the VISIBILITY data, the final chunks are not as expected (see baseline_id dimension below)

In [86]: ps.VISIBILITY.unify_chunks().chunksizes
Out[86]: Frozen({'time': (120,), 'baseline_id': (65408, 65408), 'frequency': (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 'polarization': (1,)})
maneesh29s commented 1 month ago

We are facing above issue since we started experimenting on xradio (since v0.0.28) and it still persists in v0.0.31. We can't use v0.0.33 and further because of the conversion issue that I have raised in #214