Open a-benati opened 6 months ago
Hi @a-benati , thanks for reporting this.
Can you please share the full log?
And does it help setting dist_nworker
to 1 or 2?
Best regards
Hi @Athanaseus,
thanks for your answer. Here is the full log: log-caracal.txt
Setting dist_nworker
to a higher value actually makes the situation worse since the required memory increases.
Thanks @a-benati ,
by default the parameter is 0 meaning load the entire data.
I'm curious to see the log results of dist_nmworker: 1
and want to compare the requested memory.
Regards
I believe that the issue is the absence of time and frequency chunks in the input parameters. You will be be working with extremely large chunks. I would suggest setting the input time and frequency chunks to match the solution interval on your DDE in this case.
Thanks @Athanaseus. Here is the log result of dist_nmworker: 4
, since I already have it and it would take ~9 hours to try with dist_nmworker: 1
.
log-caracal_dist_nworker_4.txt
@JSKenyon thanks for your answer. I agree with the fact that I need smaller time and frequency chunks, but I am not sure about the parameters to change: are they dd_dd_timeslots_int
and dd_dd_chan_int
? And what would be a fair value? Or, better, what do I have to inspect to understand which would be a fair value?
These is where the options are set in the ddcal worker: https://github.com/caracal-pipeline/caracal/blob/2d338e2afc5ed8fc723be95e74b071890d4a7bed/caracal/workers/ddcal_worker.py#L330-L331
I am not much of a CaraCal user so I am not sure of the easiest way to adjust those parameters.
In principle, for the parameters in the log you shared, data-time-chunk=4
would probably be ideal. Currently, it gets sets to zero which means it will treat each scan as a chunk. Let me know if you manage to give that a go and feel free to share further logs - I might be able to offer further insight.
Thanks @JSKenyon. When dist_nworker: 0
both data-time-chunk
and data-freq-chunk
are set to 0. When I set dist_nworker: 4
, I have data-time-chunk=100
and data-freq-chunk=0
, but the memory error persists (you can check the log file above). Do you think I should set data-time-chunk=4
directly in the code of ddcal_worker.py
?
You could definitely give it a try and see if it resolves the issue. I see that you have 24 directions in your model - that is pretty extreme (your model will be 24 times larger than the associated visibilities). I would also suggest making your frequency solution interval something which divides 512 (the number of channels if I am not mistaken) e.g. 128. That should prevent some complications. Unfortunately, CubiCal (the underlying software package for the ddcal step) was never particularly light on memory.
@JSKenyon thanks, I will try setting data-time-chunk=4
and data-freq-chunk=128
directly in the code and see what happens. I will keep you posted.
@Athanaseus, @JSKenyon thanks. I think I solved that error since the code gets through the part where it was stuck before. However, now I get another error, which I believe is related to flagging in DDFacet (I think all data are flagged). Here is the log file: log-caracal_new.txt
It looks like the data has been almost completely flagged, possibly by CubiCal. You should probably check your flagging before and after that step. CubiCal is also very unhappy about the SNR in many of the directions. I would suggest looking at your image prior to DD calibration to make sure that all 24 of those directions really require DD solutions.
@JSKenyon thanks. I reduced the number of facets to 12, but I don't think I really need this many directions, I only have 3 or 4 very bright sources in my field which corrupt everything else. Do you think that reducing the number of facets to 4 or 6 could solve the issue of the flagging? Or is it a completely independent problem?
Unfortunately I did not implement the DDFacet component of the visibility prediction, so I am not an expert. I think that you would likely need to edit the region file passed to CubiCal such that it only includes the 4 or so problematic sources.
Thanks @JSKenyon. Do you know which is the region file passed to CubiCal with the 4 sources or where is it created? I can edit it and tell CubiCal to use that file instead of automatically creating a new one, right?
Thanks @JSKenyon. Do you know which is the region file passed to CubiCal with the 4 sources or where is it created? I can edit it and tell CubiCal to use that file instead of automatically creating a new one, right?
Based on your log, it is /stimela_mount/output/de-Abell3667.reg
. It is created by CatDagger in the previous step. I would suggest manually creating your own region file using Carta or DS9. You could then modify the model option in the CubiCal step to use your region file (note that it appears twice in the specification of the model).
Pinging @bennahugo as he is more knowledgeable about this functionality than I am.
Yup you may need to increase the local sigma thresholding to the autotagger if you want to use it -- alternatively manually create a pixel-coordinate region file for your target with astropy / ds9 to pass into cubical per @JSKenyon 's suggestion.
The number of facets has no traction on the memory footprint though -- only the number of directions you marked in the region file. I do agree that 12 tags are on the excessive end.
On Wed, May 15, 2024 at 2:25 PM JSKenyon @.***> wrote:
Thanks @JSKenyon https://github.com/JSKenyon. Do you know which is the region file passed to CubiCal with the 4 sources or where is it created? I can edit it and tell CubiCal to use that file instead of automatically creating a new one, right?
Based on your log, it is /stimela_mount/output/de-Abell3667.reg. It is created by CatDagger in the previous step. I would suggest manually creating your own region file using Carta or DS9. You could then modify the model option in the CubiCal step to use your region file (note that it appears twice in the specification of the model).
Pinging @bennahugo https://github.com/bennahugo as he is more knowledgeable about this functionality than I am.
— Reply to this email directly, view it on GitHub https://github.com/caracal-pipeline/caracal/issues/1582#issuecomment-2112388035, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6XNBHHOTSLBIJHNASLZCNH5FAVCNFSM6AAAAABHRXT3K2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJSGM4DQMBTGU . You are receiving this because you were mentioned.Message ID: @.***>
Benjamin Hugo
@JSKenyon @bennahugo thank you. I will try by creating manually a region file with ds9 and telling CubiCal to use that file instead of the one created by CatDagger. I will let you know if it works.
@JSKenyon @bennahugo I created the region file manually with carta and I gave it as input in CubiCal, but I still get the same error related to the flagged data. I actually think that the code stops at a previous step since in the log file the point where the region file is read is not even reached. For example, before the file caracaldE_sub.log was created, but now it is not. Here is my log file. log-caracal_latest.txt
Can you please check the status of the flagging on the original data, prior to the pipeline being run? I don't think that the pipeline is resetting the flags to their original state i.e. now that your data is 100% flagged, it will remain that way.
@JSKenyon yes, my data now is 100% flagged even prior to the run of the pipeline. Do you know how could I reset the flagging? I am running caracal starting directly from the ddcal worker, maybe I need to start over from the beginning to get it right? And in that case, giving the right list of tagged sources to CubiCal should solve the error with the flagging right?
Sorry to jump in, but CARACal does support flagging resetting and rewinding in a number of ways. See https://caracal.readthedocs.io/en/latest/manual/reduction/flag/index.html .
The ddcal worker might be the only one with no flagging rewinding option, but you could add a flag worker block to your config to just do the rewinding to whatever flag version you need.
Sorry to jump in, but CARACal does support flagging resetting and rewinding in a number of ways. See https://caracal.readthedocs.io/en/latest/manual/reduction/flag/index.html .
The ddcal worker might be the only one with no flagging rewinding option, but you could add a flag worker block to your config to just do the rewinding to whatever flag version you need.
Thanks for jumping in! I am not really a CARACal expert so I appreciate it!
@paoloserra thanks! I will definitely look into that, hoping that giving the manual region file to CubiCal solves the issue.
@paoloserra I get an error saying that there aren't any flag versions for my ms file:
2024-05-17 18:06:41 CARACal INFO: flag__3: initializing
2024-05-17 18:06:41 CARACal ERROR: You have asked to rewind the flags of 1685906777_sdp_l0.ms to the version "caracal_flag__3_before" but this version
2024-05-17 18:06:41 CARACal ERROR: does not exist. The available flag versions for this .MS file are:
2024-05-17 18:06:41 CARACal ERROR: Note that if you are running Caracal on multiple targets and/or .MS files you should rewind to a flag
2024-05-17 18:06:41 CARACal ERROR: version that exists for all of them.
2024-05-17 18:06:41 CARACal ERROR: Flag version conflicts. [RuntimeError]
I attach here my log file. log-caracal_flag.txt
Hi @a-benati,
You can also provide the name of the flag version like:
rewind_flags:
enable: True
mode: rewind_to_version
version: caracal_selfcal_after
You can look up the flag versions in the flag table (<dataid>.ms.flagversions/FLAG_VERSION_LIST
) to get the one you need.
A description of the parameter is here: https://caracal.readthedocs.io/en/latest/manual/workers/flag/index.html#rewind-flags
Note that the label_in
is set to an empty string, meaning use MS=1685906777_sdp_l0.ms
.
Best regards
Hello,
the ddcal worker fails with:
MemoryError: Estimated memory usage exceeds allowed pecentage of system memory. Memory usage can be reduced by lowering the number of chunks, the dimensions of each chunk or the number of worker processes. This error can suppressed by setting --dist-safe to zero.
I don't understand which are the parameters to be modified in order to solve this problem. The number of worker processes (
dist_nworker
) is set to 0. I tried to modify thedata_chunkhours
parameter to 0.01 instead of the default 0.05 and nothing seems to be different.Here is the log file where the error is encountered:
I found the same problem in #1466, but trying to adjust the parameters
dd_g_timeslots_int
anddd_dd_timeslots_int
does not seem to improve the situation (I tried withdd_g_timeslots_int: 16
anddd_dd_timeslots_int: 16
and withdd_g_timeslots_int: 4
anddd_dd_timeslots_int: 4
).Do you know how can I solve this problem?