Open vsmagalhaes opened 2 months ago
After some testing the problem appears to be in the intersection of dask and lustre. The tests below exemplify different tests done on my desktop.
|-----------+----------+---------------+----------|
| Serial | no | local | 0:40 |
| Serial | not used | local | 0:40 |
| Parallel | yes | local | 0:50 |
| Serial | not used | Lustre | 2:40 |
| Parallel | yes | Lustre | INF |
The main conclusion is that for large enough datasets it seems that the best course of action is to avoid reducing the dataset on lustre and use the local disk. But it also appears to be a problem with dask as the parallel reduction takes longer than the serial for this file in question.
With this dataset:
/lustre/aoc/sciops/pbeaklin/holography/STEP/DEC2022/THOL0001.sb43101323.eb43107363.59929.12607288195.ms
extract_holog may fail due to excessive memory consumption when using all antennas and ddis. This might be a client issue or an astrohack issue. This issue appeared when reading the ms from lustre and writing the .holog.zarr back to lustre