ddcal worker fails due to a MemoryError

a-benati commented 2 months ago

Hello,

the ddcal worker fails with: MemoryError: Estimated memory usage exceeds allowed pecentage of system memory. Memory usage can be reduced by lowering the number of chunks, the dimensions of each chunk or the number of worker processes. This error can suppressed by setting --dist-safe to zero.

I don't understand which are the parameters to be modified in order to solve this problem. The number of worker processes (dist_nworker) is set to 0. I tried to modify the data_chunkhours parameter to 0.01 instead of the default 0.05 and nothing seems to be different.

Here is the log file where the error is encountered:

# INFO      01:08:53 - main               [0.8 11.8 0.0Gb] multi-process mode: 1+1 workers, --dist-nthread 1
# INFO      01:08:53 - wisdom             [0.8 11.8 0.0Gb] Detected a total of 503.77GiB of system memory.
# INFO      01:08:53 - wisdom             [0.8 11.8 0.0Gb] Per-solver (worker) memory use estimated at 789.68GiB: 156.75% of total system memory.
# INFO      01:08:53 - wisdom             [0.8 11.8 0.0Gb] Peak I/O memory use estimated at 571.53GiB: 113.45% of total system memory.
# INFO      01:08:53 - wisdom             [0.8 11.8 0.0Gb] Total peak memory usage estimated at 1361.20GiB: 270.20% of total system memory.
# INFO      01:08:53 - main               [0.8 11.8 0.0Gb] Exiting with exception: MemoryError(Estimated memory usage exceeds allowed pecentage of system memory. Memory usage can be reduced by lowering the number of chunks, the dimensions of each chunk or the number of worker processes. This error can suppressed by setting --dist-safe to zero.)
#  Traceback (most recent call last):
#   File "/opt/venv/lib/python3.8/site-packages/cubical/main.py", line 548, in main
#     estimate_mem(ms, tile_list, GD["data"], GD["dist"])
#   File "/opt/venv/lib/python3.8/site-packages/cubical/data_handler/wisdom.py", line 89, in estimate_mem
#     raise MemoryError(
# MemoryError: Estimated memory usage exceeds allowed pecentage of system memory. Memory usage can be reduced by lowering the number of chunks, the dimensions of each chunk or the number of worker processes. This error can suppressed by setting --dist-safe to zero.

I found the same problem in #1466, but trying to adjust the parameters dd_g_timeslots_int and dd_dd_timeslots_int does not seem to improve the situation (I tried with dd_g_timeslots_int: 16 and dd_dd_timeslots_int: 16 and with dd_g_timeslots_int: 4 and dd_dd_timeslots_int: 4).

Do you know how can I solve this problem?

Athanaseus commented 1 month ago

Hi @a-benati , thanks for reporting this.

Can you please share the full log? And does it help setting dist_nworker to 1 or 2?

Best regards

a-benati commented 1 month ago

Hi @Athanaseus,

thanks for your answer. Here is the full log: log-caracal.txt

Setting dist_nworker to a higher value actually makes the situation worse since the required memory increases.

Athanaseus commented 1 month ago

Thanks @a-benati , by default the parameter is 0 meaning load the entire data. I'm curious to see the log results of dist_nmworker: 1 and want to compare the requested memory. Regards

JSKenyon commented 1 month ago

I believe that the issue is the absence of time and frequency chunks in the input parameters. You will be be working with extremely large chunks. I would suggest setting the input time and frequency chunks to match the solution interval on your DDE in this case.

a-benati commented 1 month ago

Thanks @Athanaseus. Here is the log result of dist_nmworker: 4, since I already have it and it would take ~9 hours to try with dist_nmworker: 1. log-caracal_dist_nworker_4.txt

a-benati commented 1 month ago

@JSKenyon thanks for your answer. I agree with the fact that I need smaller time and frequency chunks, but I am not sure about the parameters to change: are they dd_dd_timeslots_int and dd_dd_chan_int? And what would be a fair value? Or, better, what do I have to inspect to understand which would be a fair value?

JSKenyon commented 1 month ago

These is where the options are set in the ddcal worker: https://github.com/caracal-pipeline/caracal/blob/2d338e2afc5ed8fc723be95e74b071890d4a7bed/caracal/workers/ddcal_worker.py#L330-L331

I am not much of a CaraCal user so I am not sure of the easiest way to adjust those parameters.

JSKenyon commented 1 month ago

In principle, for the parameters in the log you shared, data-time-chunk=4 would probably be ideal. Currently, it gets sets to zero which means it will treat each scan as a chunk. Let me know if you manage to give that a go and feel free to share further logs - I might be able to offer further insight.

a-benati commented 1 month ago

Thanks @JSKenyon. When dist_nworker: 0 both data-time-chunk and data-freq-chunk are set to 0. When I set dist_nworker: 4, I have data-time-chunk=100 and data-freq-chunk=0, but the memory error persists (you can check the log file above). Do you think I should set data-time-chunk=4 directly in the code of ddcal_worker.py?

JSKenyon commented 1 month ago

You could definitely give it a try and see if it resolves the issue. I see that you have 24 directions in your model - that is pretty extreme (your model will be 24 times larger than the associated visibilities). I would also suggest making your frequency solution interval something which divides 512 (the number of channels if I am not mistaken) e.g. 128. That should prevent some complications. Unfortunately, CubiCal (the underlying software package for the ddcal step) was never particularly light on memory.

a-benati commented 1 month ago

@JSKenyon thanks, I will try setting data-time-chunk=4 and data-freq-chunk=128 directly in the code and see what happens. I will keep you posted.

a-benati commented 1 month ago

@Athanaseus, @JSKenyon thanks. I think I solved that error since the code gets through the part where it was stuck before. However, now I get another error, which I believe is related to flagging in DDFacet (I think all data are flagged). Here is the log file: log-caracal_new.txt

JSKenyon commented 1 month ago

It looks like the data has been almost completely flagged, possibly by CubiCal. You should probably check your flagging before and after that step. CubiCal is also very unhappy about the SNR in many of the directions. I would suggest looking at your image prior to DD calibration to make sure that all 24 of those directions really require DD solutions.

a-benati commented 1 month ago

@JSKenyon thanks. I reduced the number of facets to 12, but I don't think I really need this many directions, I only have 3 or 4 very bright sources in my field which corrupt everything else. Do you think that reducing the number of facets to 4 or 6 could solve the issue of the flagging? Or is it a completely independent problem?

JSKenyon commented 1 month ago

Unfortunately I did not implement the DDFacet component of the visibility prediction, so I am not an expert. I think that you would likely need to edit the region file passed to CubiCal such that it only includes the 4 or so problematic sources.

a-benati commented 1 month ago

Thanks @JSKenyon. Do you know which is the region file passed to CubiCal with the 4 sources or where is it created? I can edit it and tell CubiCal to use that file instead of automatically creating a new one, right?

JSKenyon commented 1 month ago

Thanks @JSKenyon. Do you know which is the region file passed to CubiCal with the 4 sources or where is it created? I can edit it and tell CubiCal to use that file instead of automatically creating a new one, right?

Based on your log, it is /stimela_mount/output/de-Abell3667.reg. It is created by CatDagger in the previous step. I would suggest manually creating your own region file using Carta or DS9. You could then modify the model option in the CubiCal step to use your region file (note that it appears twice in the specification of the model).

Pinging @bennahugo as he is more knowledgeable about this functionality than I am.

bennahugo commented 1 month ago

Yup you may need to increase the local sigma thresholding to the autotagger if you want to use it -- alternatively manually create a pixel-coordinate region file for your target with astropy / ds9 to pass into cubical per @JSKenyon 's suggestion.

The number of facets has no traction on the memory footprint though -- only the number of directions you marked in the region file. I do agree that 12 tags are on the excessive end.

On Wed, May 15, 2024 at 2:25 PM JSKenyon @.***> wrote:

Thanks @JSKenyon https://github.com/JSKenyon. Do you know which is the region file passed to CubiCal with the 4 sources or where is it created? I can edit it and tell CubiCal to use that file instead of automatically creating a new one, right?

Based on your log, it is /stimela_mount/output/de-Abell3667.reg. It is created by CatDagger in the previous step. I would suggest manually creating your own region file using Carta or DS9. You could then modify the model option in the CubiCal step to use your region file (note that it appears twice in the specification of the model).

Pinging @bennahugo https://github.com/bennahugo as he is more knowledgeable about this functionality than I am.

— Reply to this email directly, view it on GitHub https://github.com/caracal-pipeline/caracal/issues/1582#issuecomment-2112388035, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6XNBHHOTSLBIJHNASLZCNH5FAVCNFSM6AAAAABHRXT3K2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJSGM4DQMBTGU . You are receiving this because you were mentioned.Message ID: @.***>

--

Benjamin Hugo

a-benati commented 1 month ago

@JSKenyon @bennahugo thank you. I will try by creating manually a region file with ds9 and telling CubiCal to use that file instead of the one created by CatDagger. I will let you know if it works.

a-benati commented 1 month ago

@JSKenyon @bennahugo I created the region file manually with carta and I gave it as input in CubiCal, but I still get the same error related to the flagged data. I actually think that the code stops at a previous step since in the log file the point where the region file is read is not even reached. For example, before the file caracaldE_sub.log was created, but now it is not. Here is my log file. log-caracal_latest.txt

JSKenyon commented 1 month ago

Can you please check the status of the flagging on the original data, prior to the pipeline being run? I don't think that the pipeline is resetting the flags to their original state i.e. now that your data is 100% flagged, it will remain that way.

a-benati commented 1 month ago

@JSKenyon yes, my data now is 100% flagged even prior to the run of the pipeline. Do you know how could I reset the flagging? I am running caracal starting directly from the ddcal worker, maybe I need to start over from the beginning to get it right? And in that case, giving the right list of tagged sources to CubiCal should solve the error with the flagging right?

paoloserra commented 1 month ago

Sorry to jump in, but CARACal does support flagging resetting and rewinding in a number of ways. See https://caracal.readthedocs.io/en/latest/manual/reduction/flag/index.html .

The ddcal worker might be the only one with no flagging rewinding option, but you could add a flag worker block to your config to just do the rewinding to whatever flag version you need.

JSKenyon commented 1 month ago

Sorry to jump in, but CARACal does support flagging resetting and rewinding in a number of ways. See https://caracal.readthedocs.io/en/latest/manual/reduction/flag/index.html .

The ddcal worker might be the only one with no flagging rewinding option, but you could add a flag worker block to your config to just do the rewinding to whatever flag version you need.

Thanks for jumping in! I am not really a CARACal expert so I appreciate it!

a-benati commented 1 month ago

@paoloserra thanks! I will definitely look into that, hoping that giving the manual region file to CubiCal solves the issue.

a-benati commented 1 month ago

@paoloserra I get an error saying that there aren't any flag versions for my ms file:

2024-05-17 18:06:41 CARACal INFO: flag__3: initializing
2024-05-17 18:06:41 CARACal ERROR: You have asked to rewind the flags of 1685906777_sdp_l0.ms to the version "caracal_flag__3_before" but this version
2024-05-17 18:06:41 CARACal ERROR: does not exist. The available flag versions for this .MS file are:
2024-05-17 18:06:41 CARACal ERROR: Note that if you are running Caracal on multiple targets and/or .MS files you should rewind to a flag
2024-05-17 18:06:41 CARACal ERROR: version that exists for all of them.
2024-05-17 18:06:41 CARACal ERROR: Flag version conflicts. [RuntimeError]

I attach here my log file. log-caracal_flag.txt

Athanaseus commented 1 month ago

Hi @a-benati,

You can also provide the name of the flag version like:

rewind_flags:
  enable:                       True
  mode:                         rewind_to_version
  version:                      caracal_selfcal_after

You can look up the flag versions in the flag table (<dataid>.ms.flagversions/FLAG_VERSION_LIST) to get the one you need. A description of the parameter is here: https://caracal.readthedocs.io/en/latest/manual/workers/flag/index.html#rewind-flags

Note that the label_in is set to an empty string, meaning use MS=1685906777_sdp_l0.ms.

Best regards

caracal-pipeline / caracal

ddcal worker fails due to a MemoryError #1582

--