casangi / astrohack

Antenna panel and position corrections.
https://astrohack.readthedocs.io/
GNU General Public License v3.0
9 stars 3 forks source link

Extract_holog may fail due excessive memory usage #262

Open vsmagalhaes opened 4 weeks ago

vsmagalhaes commented 4 weeks ago

With this dataset: /lustre/aoc/sciops/pbeaklin/holography/STEP/DEC2022/THOL0001.sb43101323.eb43107363.59929.12607288195.ms extract_holog may fail due to excessive memory consumption when using all antennas and ddis. This might be a client issue or an astrohack issue. This issue appeared when reading the ms from lustre and writing the .holog.zarr back to lustre

vsmagalhaes commented 3 weeks ago

After some testing the problem appears to be in the intersection of dask and lustre. The tests below exemplify different tests done on my desktop.

|-----------+----------+---------------+----------|
| Serial    | no       | local         |     0:40 |
| Serial    | not used | local         |     0:40 |
| Parallel  | yes      | local         |     0:50 |
| Serial    | not used | Lustre        |     2:40 |
| Parallel  | yes      | Lustre        |      INF | 

The main conclusion is that for large enough datasets it seems that the best course of action is to avoid reducing the dataset on lustre and use the local disk. But it also appears to be a problem with dask as the parallel reduction takes longer than the serial for this file in question.