casangi / xradio

Xarray Radio Astronomy Data IO
Other
9 stars 5 forks source link

High LOFAR Memory Consumption for convert_msv2_to_processing_set #151

Closed Jan-Willem closed 2 months ago

sstansill commented 3 months ago

When using xradio to convert the LOFAR measurement set "L795830_SB001_uv.MS" (17.3GiB, available upon request), the peak memory usage by xradio is 144.1 GiB (see attached image).

xradio_mem_profiling

The issue can be traced back to the casacore tables method .getcol() (line 622 of xradio/src/xradio/vis/_vis_utils/_ms/_tables/read.py as of release v0.0.25). On disk, the DATA column of the measurement set is stored as a complex16 is cast into a complex64 because NumPy only supports complex64 (see issue https://github.com/numpy/numpy/issues/14753). Other columns are similarly recast (WEIGHT goes from float16 to float32). It looks like this can't be avoided until NumPy implements lower precision complex types.

This problem isn't present in VLA measurement sets as DATA is stored as complex64 on disk.

sstansill commented 3 months ago

A related issue is https://github.com/casacore/python-casacore/issues/130. The casacore table method getcol() returns the wrong values when querying large numbers of rows. The following comment provides some useful information

...the failure mode depends on chosen chunksize:

<=45000 rows: array is completely filled (with the correct data, I will optimistically assume)

47500-50000 rows: array is unfilled from row 135420 onwards (Numerology alert! 135420 is exactly the number of rows per scan here!)

100000 rows: array is unfilled from row 167232 onwards (not a meaningful number to me...)

@tammojan do you have any insight into this one? It'd be good to know whether or not xradio is exposed to the bug

sstansill commented 2 months ago

In a private conversation between myself and Tammo Jan, the getcol() and getcolnp() methods should only be used to load small chunks of a column into memory, rather than the entire column for beamformer array data with a single scan, field, and intent. Instead the expected usage pattern is to use iteraters (http://casacore.github.io/python-casacore/casacore_tables.html#casacore.tables.tableiter).

In the case of the xradio method read_col_conversion() something similar to the following workaround would be the easiest workaround to implement (https://github.com/ratt-ru/CubiCal/blob/dfc504cd8c653dc935f7c31df55353ed398687f9/cubical/data_handler/ms_data_handler.py#L830) which incrementally populates initialised numpy arrays.

sstansill commented 2 months ago

After further investigation, it looks like one of the primary reasons for large memory usage is because the measurement set being used was Dysco compressed (https://github.com/aroffringa/dysco). Until the converter has an implementation for larger than memory ddis (ie most measurement sets from beamformer arrays like LOFAR), there isn't a way around this