caracal-pipeline / caracal

Containerized Automated Radio Astronomy Calibration (CARACal) pipeline
GNU General Public License v2.0
28 stars 6 forks source link

MemoryError When Performing Selfcal with Cubical #1560

Closed bazhaoyu closed 8 months ago

bazhaoyu commented 8 months ago

When performing selfcal on 1.7 TB MeerKAT data, I encountered a memory problem that the estimated memory usage exceeds the allowed percentage of system memory. image

The log file is attached here: log-caracal.txt

Actually, I also encountered this kind of memory problem when using ragavi-vis (see https://github.com/caracal-pipeline/caracal/issues/1558). The data size I used is around 1.7 TB. The server/node used for processing has 32 CPU cores and ~ 123 GB of memory.

I'm also seeking advice on how to handle big MeerKAT data analysis. Would it be more effective to split the dataset into smaller sizes, or use a fat memory node?

paoloserra commented 8 months ago

In addition to my advice to average the data in frequency for the purpose of continuum science, you could also reduce the number of cpu's used.

bazhaoyu commented 8 months ago

In addition to my advice to average the data in frequency for the purpose of continuum science, you could also reduce the number of cpu's used.

My first try was using 1 CPU core and default dist_max_chunks: 4. It took almost two days. Then, I tried to use 32 CPU cores and dist_max_chunks: 1, which took 1 day. I am so sorry; I have no idea how to balance the CPU cores and dist_max_chunks.

I would like to average the data in frequency to reduce the data size and then try self-calibration.

paoloserra commented 8 months ago

Nothing to be sorry about, this part of the pipeline is not easy to get right.

Did the run with 1 CPU and dist_max_chunks: 4 run out of memory?

bazhaoyu commented 8 months ago

Yes, it also ran out of memory.

image
paoloserra commented 8 months ago

OK, let's see what freq averaging does!

JSKenyon commented 8 months ago

Hi! Just pitching in here - the time chunking looks quite large i.e. loading 300 unique times per chunk when a single solution interval only requires 30. I would suggest setting the time chunking to 30 and then trying again with say dist_max_chunks: 16. I think your current setup will end up loading many, very large chunks unnecessarily.

paoloserra commented 8 months ago

You're right @JSKenyon .

In fact, @bazhaoyu , please set cal_timeslots_chunk: -1, as per default. See https://caracal.readthedocs.io/en/latest/manual/workers/selfcal/index.html#cal-timeslots-chunk .

bazhaoyu commented 8 months ago

Now I run the analysis by setting chan_ave of 5 in the transform worker, and setting cal_timeslots_chunk: -1 and dist_max_chunks: 16 in the selfcal worker. I will let you know the results ASAP. Thanks very much!

bazhaoyu commented 8 months ago

I got the same memory error. This time, the data size is only ~200 GB. I have attached the screenshot and log file:

image

log-caracal.txt

bazhaoyu commented 8 months ago

I got the same memory error. This time, the data size is only ~200 GB. I have attached the screenshot and log file:

image

log-caracal.txt

Maybe I should try with a smaller dist_max_chunks? Maybe set it as default?

paoloserra commented 8 months ago

Yes, I think it's worth a try

bazhaoyu commented 8 months ago

It works with default parameters of cal_timeslots_chunk: -1 and dist_max_chunks: 4. Thank you for all your help!

paoloserra commented 8 months ago

Good, thanks for your patience!