ddcal error in finding regions on the calibration step

rramij commented 1 year ago

ddcal worker stops with the following error. The required file "de-Abell85.reg" is not present in the 'output/3GC/Abell85_ddcal' directory.

# INFO      09:20:48 - data_handler       [0.3 11.0 0.0Gb] Input model feed rotation disabled, PA rotation disabled
# INFO      09:20:48 - data_handler       [0.3 11.0 0.0Gb] Output visibilities derotation disabled
#  - 09:20:49 - GiveModelMachine             | Initialising HOGBOM model machine
#  - 09:20:49 - AsyncProcessPool             | Fixing parent process to vthread 0
#  - 09:20:49 - AsyncProcessPool             | Worker affinities not specified, leaving unset
# INFO      09:20:49 - DDFacetSim         [0.3 11.1 0.0Gb] Initializing new source provider for DDFacet model '/stimela_mount/output/mypipelinerun_Abell85-DD-precal.DicoModel' into regions specified by '/stimela_mount/output/de-Abell85.reg'.
# INFO      09:20:49 - main               [0.3 11.1 0.0Gb] Exiting with exception: FileNotFoundError([Errno 2] No such file or directory: '/stimela_mount/output/de-Abell85.reg')
#  Traceback (most recent call last):
#   File "/opt/venv/lib/python3.8/site-packages/cubical/main.py", line 448, in main
#     ms.init_models(str(GD["model"]["list"]).split(","),
#   File "/opt/venv/lib/python3.8/site-packages/cubical/data_handler/ms_data_handler.py", line 675, in init_models
#     component = DicoSourceProvider(component,
#   File "/opt/venv/lib/python3.8/site-packages/cubical/degridder/DicoSourceProvider.py", line 58, in __init__
#     self.__clustercat = self.__read_regions_file(clustercat,
#   File "/opt/venv/lib/python3.8/site-packages/cubical/degridder/DicoSourceProvider.py", line 108, in __read_regions_file
#     with open(fn) as f:
# FileNotFoundError: [Errno 2] No such file or directory: '/stimela_mount/output/de-Abell85.reg'
# 
# Traceback (most recent call last):
#   File "/stimela_mount/code/run.py", line 57, in <module>
#     subprocess.check_call(shlex.split(_runc))
#   File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
#     raise CalledProcessError(retcode, cmd)

I have attached the log files here. mypipelinerundE_sub.log mypipelinerun_Abell85-DD-precal.log mypipelinerun_Abell85-DD-masking.log log-caracal.txt

Athanaseus commented 1 year ago

Thanks for reporting this.

This is related to #1438 & #1429

Please try installing the branch update_regions_module.

rramij commented 1 year ago

After installing the suggested branch, the region error is gone. But the new Memory error is showing an exception.

# INFO      01:20:39 - data_handler       [0.3 11.1 0.0Gb]     direction 15: DDE_REG8(/stimela_mount/output/mypipelinerun_Abell85-DD-precal.DicoModel)
# INFO      01:20:39 - data_handler       [0.3 11.1 0.0Gb]     direction 16: DDE_REG9(/stimela_mount/output/mypipelinerun_Abell85-DD-precal.DicoModel)
# INFO      01:20:39 - main               [0.3 11.1 0.0Gb] subtraction directions set to slice(1, None, None)
# INFO      01:20:39 - param_db           [0.3 11.1 0.0Gb] creating /stimela_mount/output/g_final-cal_A85_MeerKAT_Lband_Abell85_corr_1_Abell85.parmdb in fragmented mode
# INFO      01:20:39 - gain_machine       [0.3 11.1 0.0Gb] G solutions will be saved to /stimela_mount/output/g_final-cal_A85_MeerKAT_Lband_Abell85_corr_1_Abell85.parmdb
# INFO      01:20:39 - param_db           [0.3 11.1 0.0Gb] creating /stimela_mount/output/dd_cal_final_A85_MeerKAT_Lband_Abell85_corr_1_Abell85.parmdb in fragmented mode
# INFO      01:20:39 - gain_machine       [0.3 11.1 0.0Gb] DD solutions will be saved to /stimela_mount/output/dd_cal_final_A85_MeerKAT_Lband_Abell85_corr_1_Abell85.parmdb
# INFO      01:20:39 - param_db           [0.3 11.1 0.0Gb] creating /stimela_mount/output/mypipelinerundE_sub-BBC-field_0-ddid_None.parmdb in fragmented mode
# INFO      01:20:39 - gain_machine       [0.3 11.1 0.0Gb] will compute & save suggested baseline-based corrections (BBCs) to /stimela_mount/output/mypipelinerundE_sub-BBC-field_0-ddid_None.parmdb
# INFO      01:20:39 - gain_machine       [0.3 11.1 0.0Gb]   (these can optionally be applied in a subsequent CubiCal run)
# INFO      01:20:39 - main               [0.3 11.1 0.0Gb] defining chunks (time 0, freq 0, also when SCAN_NUMBER jumps > 1.0)
# INFO      01:20:40 - data_handler       [0.7 11.7 0.0Gb]   read indexing columns (5071662 total rows)
# INFO      01:20:41 - data_handler       [0.7 11.7 0.0Gb]   built timeslot index (2682 unique timestamps)
# INFO      01:20:41 - data_handler       [0.7 11.7 0.0Gb]   max chunk size is -- timeslots and/or -- seconds
# INFO      01:20:50 - data_handler       [0.8 11.8 0.0Gb]   found 36 time chunks: 0:0:0 1:141825:75 2:283650:150 3:423584:224 4:563518:298 5:705343:373 6:847168:448 7:988993:523 8:1130818:598 9:1270752:672 10:1410686:746 11:1550620:820 12:1690554:894 13:1830488:968 14:1972313:1043 15:2112247:1117 16:2252181:1191 17:2392115:1265 18:2532049:1339 19:2671983:1413 20:2813808:1488 21:2953742:1562 22:3095567:1637 23:3235501:1711 24:3375435:1785 25:3517260:1860 26:3657194:1934 27:3799019:2009 28:3940844:2084 29:4082669:2159 30:4222603:2233 31:4362537:2307 32:4504362:2382 33:4646187:2457 34:4788012:2532 35:4929837:2607 2682
# INFO      01:20:50 - data_handler       [0.8 11.8 0.0Gb]   generated 36 row chunks based on time and DDID
# INFO      01:20:50 - data_handler       [0.8 11.8 0.0Gb]   row chunks yield 36 potential tiles
# INFO      01:20:50 - data_handler       [0.8 11.8 0.0Gb]   coarsening this to 18 tiles (max 2 chunks per tile, based on 1/0 requested)
# INFO      01:20:50 - main               [1.0 12.3 0.0Gb] Cannot use TBB threading (check your installation). Reverting to workqueue parallelization and dropping to 1 thread per worker in solvers
# INFO      01:20:50 - main               [1.0 12.3 0.0Gb] multi-process mode: 1+1 workers, --dist-nthread 1
# INFO      01:20:50 - wisdom             [1.0 12.3 0.0Gb] Detected a total of 503.49GiB of system memory.
# INFO      01:20:50 - wisdom             [1.0 12.3 0.0Gb] Per-solver (worker) memory use estimated at 897.68GiB: 178.29% of total system memory.
# INFO      01:20:50 - wisdom             [1.0 12.3 0.0Gb] Peak I/O memory use estimated at 649.69GiB: 129.04% of total system memory.
# INFO      01:20:50 - wisdom             [1.0 12.3 0.0Gb] Total peak memory usage estimated at 1547.37GiB: 307.33% of total system memory.
# INFO      01:20:50 - main               [1.0 12.3 0.0Gb] Exiting with exception: MemoryError(Estimated memory usage exceeds allowed pecentage of system memory. Memory usage can be reduced by lowering the number of chunks, the dimensions of each chunk or the number of worker processes. This error can suppressed by setting --dist-safe to zero.)
#  Traceback (most recent call last):
#   File "/opt/venv/lib/python3.8/site-packages/cubical/main.py", line 548, in main
#     estimate_mem(ms, tile_list, GD["data"], GD["dist"])
#   File "/opt/venv/lib/python3.8/site-packages/cubical/data_handler/wisdom.py", line 89, in estimate_mem
#     raise MemoryError(
# MemoryError: Estimated memory usage exceeds allowed pecentage of system memory. Memory usage can be reduced by lowering the number of chunks, the dimensions of each chunk or the number of worker processes. This error can suppressed by setting --dist-safe to zero.
#

Please guide me on how to reduce chunks. The parameters such as dd_g_timeslots_int, dd_g_chan_int, dd_dd_timeslots_int and dd_dd_chan_int were left to default values during the run. Should I increase the timeslots to make fewer chunks or decrease the timeslots to reduce the size? I am a bit confused.

Athanaseus commented 1 year ago

This is related to #1178 & #1085, and please checkout https://github.com/caracal-pipeline/caracal/issues/1017#issuecomment-621118563 for more info.

rramij commented 1 year ago

The documentation manual does not explain how to set the --data-time-chunk. The selfcal parameter cal_timeslots_chunk is not working in ddcal. I have tried dd_data_time_chunk, data_time_chunk, dd_data_time but nothing is working. Please guide me with the input parameter name.

KshitijT commented 1 year ago

@rramij, please try a lower value for dist-nworker parameter (maybe 4) - that should help.

rramij commented 1 year ago

I tried setting dist_nworker : 4 but it made the situation worse. Later I realised that in the first run, it was set to default 0 and during the pipeline run it automatically set to 1 worker. Please see the comparison below.

# INFO      01:20:50 - data_handler       [0.8 11.8 0.0Gb]   generated 36 row chunks based on time and DDID
# INFO      01:20:50 - data_handler       [0.8 11.8 0.0Gb]   row chunks yield 36 potential tiles
# INFO      01:20:50 - data_handler       [0.8 11.8 0.0Gb]   coarsening this to 18 tiles (max 2 chunks per tile, based on 1/0 requested)
# INFO      01:20:50 - main               [1.0 12.3 0.0Gb] Cannot use TBB threading (check your installation). Reverting to workqueue parallelization and dropping to 1 thread per worker in solvers
# INFO      01:20:50 - main               [1.0 12.3 0.0Gb] multi-process mode: 1+1 workers, --dist-nthread 1
# INFO      01:20:50 - wisdom             [1.0 12.3 0.0Gb] Detected a total of 503.49GiB of system memory.
# INFO      01:20:50 - wisdom             [1.0 12.3 0.0Gb] Per-solver (worker) memory use estimated at 897.68GiB: 178.29% of total system memory.
# INFO      01:20:50 - wisdom             [1.0 12.3 0.0Gb] Peak I/O memory use estimated at 649.69GiB: 129.04% of total system memory.
# INFO      01:20:50 - wisdom             [1.0 12.3 0.0Gb] Total peak memory usage estimated at 1547.37GiB: 307.33% of total system memory.
# INFO      01:20:50 - main               [1.0 12.3 0.0Gb] Exiting with exception: MemoryError(Estimated memory usage exceeds allowed pecentage of system memory. Memory usage can be reduced by lowering the number of chunks, the dimensions of each chunk or the number of worker processes. This error can suppressed by setting --dist-safe to zero.)

# INFO      15:20:27 - data_handler       [0.8 11.8 0.0Gb]   generated 36 row chunks based on time and DDID
# INFO      15:20:27 - data_handler       [0.8 11.8 0.0Gb]   row chunks yield 36 potential tiles
# INFO      15:20:27 - data_handler       [0.8 11.8 0.0Gb]   coarsening this to 9 tiles (max 4 chunks per tile, based on 4/4 requested)
# INFO      15:20:27 - main               [1.1 12.3 0.0Gb] Cannot use TBB threading (check your installation). Reverting to workqueue parallelization and dropping to 1 thread per worker in solvers
# INFO      15:20:27 - main               [1.1 12.3 0.0Gb] multi-process mode: --dist-nworker 4 (+1), --dist-nthread 1
# INFO      15:20:27 - wisdom             [1.1 12.3 0.0Gb] Detected a total of 503.49GiB of system memory.
# INFO      15:20:27 - wisdom             [1.1 12.3 0.0Gb] Per-solver (worker) memory use estimated at 897.68GiB: 178.29% of total system memory.
# INFO      15:20:27 - wisdom             [1.1 12.3 0.0Gb] Peak I/O memory use estimated at 1299.00GiB: 258.00% of total system memory.
# INFO      15:20:27 - wisdom             [1.1 12.3 0.0Gb] Total peak memory usage estimated at 4889.71GiB: 971.16% of total system memory.
# INFO      15:20:27 - main               [1.1 12.3 0.0Gb] Exiting with exception: MemoryError(Estimated memory usage exceeds allowed pecentage of system memory. Memory usage can be reduced by lowering the number of chunks, the dimensions of each chunk or the number of worker processes. This error can suppressed by setting --dist-safe to zero.)

It seems I need to make the chunk size small enough to accommodate the memory. Please help me with the input method for chunk size. That should solve the problem.

KshitijT commented 1 year ago

Sorry @rramij , I was confused by the earlier caracal log. Could you please share the log for the last run?

rramij commented 1 year ago

log-caracal.txt

JSKenyon commented 1 year ago

The issue here is that the model is going to be incredibly large (more-so due to the internal representation in CubiCal). It may be very slow, but I would set ncpu, nworker and max_chunks to 1. Decreasing the solution interval on the DE so that you could decrease the chunk size in time may also help. Bear in mind that it is crashing because the memory estimates (emphasis on estimate) suggest that it will end up hitting swap. You can force it to ignore this by setting dist_safe to False. Hopefully we will have the functionality and time to replace CubiCal with QuartiCal as some point soon - it is much more streamlined from a memory perspective.

KshitijT commented 1 year ago

The issue here is that the model is going to be incredibly large (more-so due to the internal representation in CubiCal). It may be very slow, but I would set ncpu, nworker and max_chunks to 1. Decreasing the solution interval on the DE so that you could decrease the chunk size in time may also help. Bear in mind that it is crashing because the memory estimates (emphasis on estimate) suggest that it will end up hitting swap. You can force it to ignore this by setting dist_safe to False. Hopefully we will have the functionality and time to replace CubiCal with QuartiCal as some point soon - it is much more streamlined from a memory perspective.

@rramij , just a bit of helpful annotation on @JSKenyon's reply:

For ddcal worker, the cubical parameters translate as -- ncpu --> dist_ncpu nworker --> dist_nworker max_chunks --> dist_nworker The solution interval can be set by dd-time-int dist_safe is not added as an explicit parameter in the worker, so it can't be set/unset, but I can add that if you need.

KshitijT commented 1 year ago

@rramij , could you please confirm this works for you?

rramij commented 1 year ago

I was running other codes as well afterwards. So, couldn't check it before. Now I am trying it again. But getting some weird error when trying to skip the ddfacet imaging run by setting the image_dd enable: False.

ERROR: output/3GC/Abell85_ddcal/de-Abell85.reg already exists [OSError] Is there any way to skip running ddfacet every time I try to do the calibration part? Am I doing it wrong? log.txt

KshitijT commented 1 year ago

I was running other codes as well afterwards. So, couldn't check it before. Now I am trying it again. But getting some weird error when trying to skip the ddfacet imaging run by setting the image_dd enable: False.

ERROR: output/3GC/Abell85_ddcal/de-Abell85.reg already exists [OSError] Is there any way to skip running ddfacet every time I try to do the calibration part? Am I doing it wrong? log.txt

Unfortunately there is no way to skip the initial ddfacet imaging run as of yet; this is a precaution set up to avoid carrying out dd-calibration without the correct model. image_dd switch is only for imaging post-dd-calibration (which you may or may not wish to do).

The error you are getting is because the region file is already in the output folder from the previous ddcal worker run. This is a bug and I'll fix it; meanwhile you can rerun by deleting the region file from the output folder.

rramij commented 1 year ago

After setting dd_g_timeslots_int: 16 and dd_dd_timeslots_int: 16 I found out that the --data-time-chunk is set to 16 as well automatically. Therefore the memory issue is now resolved and cubical calibration started normally.

KshitijT commented 1 year ago

After setting dd_g_timeslots_int: 16 and dd_dd_timeslots_int: 16 I found out that the --data-time-chunk is set to 16 as well automatically. Therefore the memory issue is now resolved and cubical calibration started normally.

Thanks for confirming ! I am closing this issue, please reopen if you encounter the same problem again.

caracal-pipeline / caracal

ddcal error in finding regions on the calibration step #1466