cubical falling over because of memory issues on idia

healytwin1 commented 4 years ago

I have been trying to run the self-cal worker, but it is failing at the cubical step on the first selfcal loop. I have tried this on both my desktop (running podman) and on ilifu (with singularity), the errors are not the same, but the fail at the same point. I have used the same config file and the same dataset.

The podman error:

# INFO       - 06:48:50 - main               [0.1/0.9 2.1/2.2 0.9Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.)
#  Traceback (most recent call last):
#   File "/usr/local/lib/python3.6/dist-packages/cubical/main.py", line 540, in main
#     stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 214, in run_process_loop
#     return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 274, in _run_multi_process_loop
#     if not io_futures[itile].result():
#   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
#     return self.__get_result()
#   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
#     raise self._exception
# concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
# 
# Traceback (most recent call last):
#   File "/stimela_mount/code/run.py", line 57, in <module>
#     subprocess.check_call(shlex.split(_runc))
#   File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
#     raise CalledProcessError(retcode, cmd)
# subprocess.CalledProcessError: Command '['gocubical', '--sol-jones', 'G', '--data-ms', '/stimela_mount/msdir/A2626_1563407160_XXYY-Abell2626_corr.ms', '--data-column', 'DATA', '--data-time-chunk', '360', '--data-freq-chunk', '0', '--model-list', 'MODEL_DATA', '--model-ddes', 'never', '--montblanc-dtype', 'float', '--weight-column', 'WEIGHT', '--flags-apply', '-cubical', '--madmax-enable', '1', '--madmax-estimate', 'corr', '--madmax-plot', '1', '--madmax-threshold', '0,10', '--sel-diag', '1', '--sol-term-iters', '5,0', '--bbc-save-to', '/stimela_mount/output/continuum/selfcal_products/bbc-gains-1-A2626_1563407160_XXYY-Abell2626_corr.parmdb', '--dist-ncpu', '7', '--dist-max-chunks', '2', '--out-name', '/stimela_mount/output/continuum/selfcal_products/A2626_E1_v1_A2626_1563407160_XXYY-Abell2626_corr_1_cubical', '--out-overwrite', '1', '--out-mode', 'sc', '--out-casa-gaintables', '1', '--out-plots', '1', '--dist-nworker', '0', '--log-memory', '1', '--log-boring', '1', '--g-time-int', '120', '--g-freq-int', '0', '--g-clip-low', '0.5', '--g-clip-high', '2.0', '--g-solvable', '1', '--g-type', 'phase-diag', '--g-save-to', '/stimela_mount/output/continuum/selfcal_products/g-phase-gains-1-A2626_1563407160_XXYY-Abell2626_corr.parmdb', '--g-update-type', 'phase-diag', '--g-max-prior-error', '0.1', '--g-max-post-error', '0.1', '--degridding-OverS', '11', '--degridding-Support', '7', '--degridding-Nw', '100', '--degridding-wmax', '0', '--degridding-Padding', '1.7', '--degridding-NDegridBand', '16', '--degridding-MaxFacetSize', '0.25', '--degridding-MinNFacetPerAxis', '1']' returned non-zero exit status 1.
2020-04-28 08:48:56 CARACal.Stimela.calibrate_cubical_1_0 ERROR: podman returns error code 1
2020-04-28 08:48:56 CARACal.Stimela.calibrate_cubical_1_0 ERROR: job failed at 2020-04-28 08:48:56.890757 after 0:02:26.837220
2020-04-28 08:48:57 CARACal ERROR: Job 'calibrate_cubical_1_0:: Calibrate step 1 ms=A2626_1563407160_XXYY-Abell2626_corr.ms' failed: podman returns error code 1 [PipelineException]
2020-04-28 08:48:57 CARACal INFO:   More information can be found in the logfile at output/log-caracal-20200428-065326.txt

and the singularity error:

# WARNING    - 10:11:43 - solver             [x04] [59.8/70.7 62.7/73.6 25.6Gb] 'G' D0T3F0 Stations 13 (1/62) fully flagged due to low SNR. These stations may be faulty or your SNR requirements (max-pri
or-error) are not met. New flags will be raised for this chunk of data
# WARNING    - 10:11:47 - solver             [x03] [60.2/71.2 63.1/74.1 25.6Gb] 'G' D0T2F0 Stations 13 (1/62) fully flagged due to low SNR. These stations may be faulty or your SNR requirements (max-pri
or-error) are not met. New flags will be raised for this chunk of data
# INFO       - 10:12:11 - ms_tile            [io] [35.2/47.5 43.4/50.3 25.6Gb]   22290694 (3.06%) model visibilities flagged as 0/inf/nan
# ERROR      - 10:14:13 - solver             [x03] [71.2/77.2 75.5/80.2 25.6Gb] Solver for tile 1 chunk D0T2F0 failed with exception: Unable to allocate 10.7 GiB for an array with shape (150, 621, 62, 6
2, 2, 2) and data type complex64
# INFO       - 10:14:13 - solver             [x03] [71.2/77.2 75.5/80.2 25.6Gb] Traceback (most recent call last):
#   File "/usr/local/lib/python3.6/dist-packages/cubical/solver.py", line 857, in run_solver
#     corr_vis = solver_machine.run()
#   File "/usr/local/lib/python3.6/dist-packages/cubical/solver.py", line 680, in run
#     SolveOnly.run(self)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/solver.py", line 664, in run
#     self.sol_opts, label=self.label)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/solver.py", line 353, in _solve_gains
#     stats.estimate_noise(resid_arr, flags_arr, residuals=True)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/statistics.py", line 160, in estimate_noise
#     deltavis2[:, 1:, ...]  = np.square(abs(data[0, :, 1:, ...] - data[0, :, :-1, ...])).sum(axis=(-2,-1))
# MemoryError: Unable to allocate 10.7 GiB for an array with shape (150, 621, 62, 62, 2, 2) and data type complex64
# Process Process-4:
# Traceback (most recent call last):
#   File "/usr/lib/python3.6/concurrent/futures/process.py", line 175, in _process_worker
#     r = call_item.fn(*call_item.args, **call_item.kwargs)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/solver.py", line 857, in run_solver
#     corr_vis = solver_machine.run()
#   File "/usr/local/lib/python3.6/dist-packages/cubical/solver.py", line 680, in run
#     SolveOnly.run(self)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/solver.py", line 664, in run
#     self.sol_opts, label=self.label)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/solver.py", line 353, in _solve_gains
#     stats.estimate_noise(resid_arr, flags_arr, residuals=True)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/statistics.py", line 160, in estimate_noise
#     deltavis2[:, 1:, ...]  = np.square(abs(data[0, :, 1:, ...] - data[0, :, :-1, ...])).sum(axis=(-2,-1))
# MemoryError: Unable to allocate 10.7 GiB for an array with shape (150, 621, 62, 62, 2, 2) and data type complex64
# 
# During handling of the above exception, another exception occurred:
# 
# Traceback (most recent call last):
#   File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
#     self.run()
#   File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
#     self._target(*self._args, **self._kwargs)
#   File "/usr/lib/python3.6/concurrent/futures/process.py", line 178, in _process_worker
#     result_queue.put(_ResultItem(call_item.work_id, exception=exc))
#   File "/usr/lib/python3.6/multiprocessing/queues.py", line 341, in put
#     obj = _ForkingPickler.dumps(obj)
#   File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
#     cls(buf, protocol).dump(obj)
# _pickle.PicklingError: Can't pickle <class 'MemoryError'>: it's not the same object as builtins.MemoryError
# INFO       - 10:15:01 - ms_tile            [io] [41.2/48.0 43.9/50.7 31.1Gb]   22290694 (3.06%) visibilities flagged due to 0/inf/nan model
# INFO       - 10:15:01 - main               [io] [7.6/48.0 10.2/50.7 31.1Gb] I/O job(s) complete
# INFO       - 10:15:02 - main               [0.8/0.8 3.8/3.8 30.9Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.)
#  Traceback (most recent call last):
#   File "/usr/local/lib/python3.6/dist-packages/cubical/main.py", line 540, in main
#     stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 214, in run_process_loop
#     return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 300, in _run_multi_process_loop
#     stats = future.result()
#   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
#     return self.__get_result()
#   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
#     raise self._exception
# concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
# 
# Traceback (most recent call last):
#   File "/stimela_mount/code/run.py", line 57, in <module>
#     subprocess.check_call(shlex.split(_runc))
#   File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
#     raise CalledProcessError(retcode, cmd)
# subprocess.CalledProcessError: Command '['gocubical', '--sol-jones', 'G', '--data-ms', '/stimela_mount/msdir/A2626_1563407160_XXYY-Abell2626_corr.ms', '--data-column', 'DATA', '--data-time-chunk', '360', '--data-freq-chunk', '0', '--model-list', 'MODEL_DATA', '--model-ddes', 'never', '--montblanc-dtype', 'float', '--weight-column', 'WEIGHT', '--flags-apply', '-cubical', '--madmax-enable', '1', '--madmax-estimate', 'corr', '--madmax-plot', '1', '--madmax-threshold', '0,10', '--sel-diag', '1', '--sol-term-iters', '5,0', '--bbc-save-to', '/stimela_mount/output/continuum/selfcal_products/bbc-gains-1-A2626_1563407160_XXYY-Abell2626_corr.parmdb', '--dist-ncpu', '7', '--dist-max-chunks', '2', '--out-name', '/stimela_mount/output/continuum/selfcal_products/A2626_E1_v1_A2626_1563407160_XXYY-Abell2626_corr_1_cubical', '--out-overwrite', '1', '--out-mode', 'sc', '--out-casa-gaintables', '1', '--out-plots', '1', '--dist-nworker', '0', '--log-memory', '1', '--log-boring', '1', '--g-time-int', '120', '--g-freq-int', '0', '--g-clip-low', '0.5', '--g-clip-high', '2.0', '--g-solvable', '1', '--g-type', 'phase-diag', '--g-save-to', '/stimela_mount/output/continuum/selfcal_products/A2626_E1_v1-g-phase-gains-1-A2626_1563407160_XXYY-Abell2626_corr.parmdb', '--g-update-type', 'phase-diag', '--g-max-prior-error', '0.1', '--g-max-post-error', '0.1', '--degridding-OverS', '11', '--degridding-Support', '7', '--degridding-Nw', '100', '--degridding-wmax', '0', '--degridding-Padding', '1.7', '--degridding-NDegridBand', '16', '--degridding-MaxFacetSize', '0.25', '--degridding-MinNFacetPerAxis', '1']' returned non-zero exit status 1.
2020-04-28 10:15:03 CARACal.Stimela.calibrate-cubical-field0-iter1 ERROR: singularity returns error code 1

KshitijT commented 4 years ago

Presumably the same as stimela issue 577.

healytwin1 commented 4 years ago

Looks like it

SpheMakh commented 4 years ago

I think this happens when cubical does not have enough shared memory. Podman and singularity cannot allocate resources, so your system admin has to increase the share memory usage.

healytwin1 commented 4 years ago

I haven't had this issue with previous versions of caracal on singularity.

SpheMakh commented 4 years ago

It looks singularity is stating to address this:

https://sylabs.io/guides/3.0/user-guide/cgroups.html

SpheMakh commented 4 years ago

I haven't had this issue with previous versions of caracal on singularity.

On the same system, and with same data ?

healytwin1 commented 4 years ago

Yup, the only thing that has changed is the config file and the version on caracal

o-smirnov commented 4 years ago

Yes it's the same problem as @viralp reported in #577 (yes @viralp, I have not forgotten).

Sadly, I don't think this is a shared memory issue. If you look at the line where the exception is thrown, it's a bog-standard array allocation in all cases, not an shm allocation.

A bit puzzling, because the memory usage reported prior to that ("[71.2/77.2 75.5/80.2 25.6Gb]") is not high at all.

Could you please try to extract this into a standalone failure for us (here's an MS, here's a config.yml, things fail when I run caracal -sw selfcal).

SpheMakh commented 4 years ago

In that case, can you try using installing the master branch of stimela julia

pip install --force-reinstall -U git+https://github.com/ratt-ru/Stimela

gigjozsa commented 4 years ago

That's (https://github.com/ratt-ru/Stimela/issues/577).

healytwin1 commented 4 years ago

In that case, can you try using installing the master branch of stimela julia
pip install --force-reinstall -U git+https://github.com/ratt-ru/Stimela

Having done this with both the singularity and the podman installations has resulted now in the same errors - memory allocation not available.

healytwin1 commented 4 years ago

Could you please try to extract this into a standalone failure for us (here's an MS, here's a config.yml, things fail when I run caracal -sw selfcal).

The MS I have been using spans 130MHz (~87gb before I do anything). I am attaching the config file used and the corresponding log files epoch1.txt log-caracal-podman.txt log-caracal-singularity.txt

which shows the errors.

SpheMakh commented 4 years ago

INFO      18:24:17 - wisdom             [0.8/0.8 3.7/3.8 0.0Gb] Peak I/O memory use estimated at 30.15GiB: 12.77% of total system memory.
# INFO      18:24:17 - wisdom             [0.8/0.8 3.7/3.8 0.0Gb] Total peak memory usage estimated at 279.66GiB: 118.42% of total system memory.
# INFO      18:24:17 - main               [0.8/0.8 3.7/3.8 0.0Gb] Exiting with exception: MemoryError(Estimated memory usage exceeds total system memory. Memory usage can be reduced by lowering the number of chunks, the dimensions of each chunk or the number of worker processes. This error can suppressed by unsetting --dist-safe.)
#  Traceback (most recent call last):

This is good! The wisdom files are working. Given the estimated of memory usage overflow, I would re-run with 5 CPUs instead of 7

gigjozsa commented 4 years ago

@SpheMakh when and how exactly can that happen and how can we warn/inform the user?

gigjozsa commented 4 years ago

Also how to remove the error (why would the memory usage be reduced by reducing the number of CPUs?)

JSKenyon commented 4 years ago

Glancing at the log (bear in mind I don't know precisely what you are doing) there are a handful of things which could be improved.

--data-time-chunk could be set to be equal to the solution interval on G. CubiCal operates in parallel over chunks, thus more, smaller chunks is usually preferable.
--dist-max-chunks is set to 1. This will basically disable parallelism as only a single chunk (with some caveats) will be loaded into memory. This should not be lower than --dist-ncpu and I will consider putting a check in for this.
--dist-nthread can be set higher if you have spare compute capacity but are running into memory woes. This should (imperfectly) enable nested parallelism which enhances compute without compromising memory.

In reply to your question @gigjozsa, the answer is implicit in the way CubiCal operates in parallel over chunks using multiprocessing. Each processor is assigned a chunk and processing that chunk has a certain memory footprint. By using fewer processes, fewer chunks will be processed simultaneously and in turn reduce the memory footprint.

For the example above I strongly suggest setting --data-time-chunk 120, --dist-max-chunks 5 and --dist-ncpu 6. Setting --dist-nthreads to your spare compute capacity might also help.

gigjozsa commented 4 years ago

Thanks, @JSKenyon Is there any way to predict this?

JSKenyon commented 4 years ago

Another point to consider - does G truly need to be solved over all the channels? By having infinite solution interval along an axis, it becomes impossible to chunk the data along that axis. This in turn limits the number of chunks that can be processed simultaneously. If anyone is interested/motivated, it might be interesting to solve for G using some generous but finite (say 50% of the channels) solution interval and similar chunking. It might then be possible to quantify how the quality of the solution was impacted by the reduction in solution interval (my suspicion is that it may in fact be improved). The only lingering concern is one of source suppression, but someone with more know-how than I could likely quantify that too.

gigjozsa commented 4 years ago

What I mean is if the user informs us about a maximum memory allocation, can we

tell the user that it is not sufficient
automatically adjust the parameters controlling system resources to prevent a crash

JSKenyon commented 4 years ago

@gigjozsa The wisdom module I added is a step in that direction but it is a non-trivial problem. Changing options for the user can also be a bit of a trap. Bear in mind that things like solution interval place some fundamental limits on problem size, particularly when they are infinite, and adjusting solution interval is a non-starter for many reasons.

gigjozsa commented 4 years ago

Should we close this issue, then (once the discussion about the best parameter choice, which is very helpful, is finished), and create a new one to propagate wisdom warnings/insights to the caracal user?

SpheMakh commented 4 years ago

Let's see if @healytwin1 's run goes through with the suggestions.

healytwin1 commented 4 years ago

Changed those parameters, but still get the memory error on both podman and singularity.

JSKenyon commented 4 years ago

Hey @healytwin1. Could you please post the log from your most recent failed run?

healytwin1 commented 4 years ago

log-podman2.txt log-singularity2.txt

JSKenyon commented 4 years ago

Thanks! Out of interest, are you the only person using the system? Those are genuine memory allocation errors.

It is entirely possible that my wisdom estimates are failing, but I would need the MS to be sure. The podman example is unlikely to work as it is configured with only 32GB of RAM.

healytwin1 commented 4 years ago

Thanks! Out of interest, are you the only person using the system? Those are genuine memory allocation errors.

Yes. The podman one is my desktop, so I am the only one using it. And with the singularity run, I made sure that I was allocated the maximum memory and cpus for a single node on ilifu.

It is entirely possible that my wisdom estimates are failing, but I would need the MS to be sure. The podman example is unlikely to work as it is configured with only 32GB of RAM.

My desktop also has ~500gb swap mem, so thought this should still be able to run, albeit much slower than a higher memory system.

JSKenyon commented 4 years ago

Ah ok. In the podman case, you can try setting --dist-safe 0, but it will likely take days to process if it processes at all. Currently CubiCal is killing itself because it knows it will hit swap, which is usually game over.

Could you please post the output of ulimit -a for the ilifu node?

@o-smirnov Was Viral also working on an ilifu/IDIA node? I am growing suspicious.

healytwin1 commented 4 years ago

Ah ok. In the podman case, you can try setting --dist-safe 0, but it will likely take days to process if it processes at all. Currently CubiCal is killing itself because it knows it will hit swap, which is usually game over.

Hmm, okay. Will try that.

Could you please post the output of ulimit -a for the ilifu node?

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 967206
max locked memory       (kbytes, -l) 16384
max memory size         (kbytes, -m) 247463936
open files                      (-n) 16384
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 967206
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

gigjozsa commented 4 years ago

Maybe contact sysadmin. The maximum allocated RAM seems to be 16 GB, which is far too little. If I remember well you can actually specify that, but I believe ilifu changed some rules. I hope I will find the time to test this today.

o-smirnov commented 4 years ago

Yeah I think @gigjozsa is right, the ulimit -l setting is way too low. Can you change it yourself. @healytwin1? What does it say if you do ulimit -l unlimited?

healytwin1 commented 4 years ago

ulimit -l unlimited gives: bash: ulimit: max locked memory: cannot modify limit: Operation not permitted

I had used the following command that was supposed to assign the maximum resources for a particular node: srun -N 1 --mem 236G --ntasks-per-node 1 --cpus-per-task 32 --time 13-24 --pty bash

o-smirnov commented 4 years ago

I think you must shake the sysadmins about this. There's no reason to allocate you 236G, and then make only 16G lockable. It's downright perverse.

SpheMakh commented 4 years ago

pinging @viralp

viralp commented 4 years ago

Sorry for jumping late into the discussion. I installed Caracal and Stimela (master branch) on the ike Rhodes server and com04 machine. This assures me the same packages and singularity installation. I was getting the same memory error on the ike as @healytwin1 reported. But the same process with the same config file ran smoothly on the com04 machine. So I think the issue is in system-wide memory allocation.

healytwin1 commented 4 years ago

I have managed to step over the memory error (on podman) for now by frequency averaging my dataset, using a larger solution interval and decreasing the number of cpus, but I now hit another error which I do not understand:

# INFO      11:24:52 - wisdom             [0.8/0.9 1.9/2.0 0.2Gb] Detected a total of 31.32GiB of system memory.
# INFO      11:24:52 - wisdom             [0.8/0.9 1.9/2.0 0.2Gb] Per-solver (worker) memory use estimated at 6.74GiB: 21.51% of total system memory.
# INFO      11:24:52 - wisdom             [0.8/0.9 1.9/2.0 0.2Gb] Peak I/O memory use estimated at 7.20GiB: 22.98% of total system memory.
# INFO      11:24:52 - wisdom             [0.8/0.9 1.9/2.0 0.2Gb] Total peak memory usage estimated at 20.67GiB: 66.00% of total system memory.
# INFO      11:24:52 - data_handler       [0.8/0.9 1.9/2.0 0.2Gb]   inserting new column BITFLAG
# INFO      11:24:52 - data_handler       [0.8/0.9 1.9/2.0 0.2Gb]   inserting new column BITFLAG_ROW
# INFO      11:24:53 - data_handler       [0.9/0.9 1.9/2.0 0.2Gb]   auto-filling bitflag 'legacy' from FLAG/FLAG_ROW column. Please do not interrupt this process!
# INFO      11:24:53 - data_handler       [0.9/0.9 1.9/2.0 0.2Gb]     note that all other bitflags will be cleared by this
# INFO      11:25:05 - data_handler       [0.9/1.5 1.9/2.7 0.2Gb]   auto-fill complete
# INFO      11:25:05 - data_handler       [0.9/1.5 1.9/2.7 0.2Gb]   BITFLAG column defines the following flagsets: legacy:1
# INFO      11:25:05 - data_handler       [0.9/1.5 1.9/2.7 0.2Gb]   will exclude flagset cubical
# INFO      11:25:05 - data_handler       [0.9/1.5 1.9/2.7 0.2Gb]     flagset 'cubical' not found -- ignoring
# INFO      11:25:05 - data_handler       [0.9/1.5 1.9/2.7 0.2Gb]   applying BITFLAG mask 1 to input data
# INFO      11:25:05 - data_handler       [0.9/1.5 1.9/2.7 0.2Gb]   will save output flags into BITFLAG 'cubical' (2), and into FLAG/FLAG_ROW
# INFO      11:25:05 - main               [0.9/1.5 2.0/2.7 0.2Gb] waiting for I/O on tile 0/5
# INFO      11:25:05 - main               [io] [0.8/0.8 1.9/1.9 0.2Gb] loading tile 0/5
# INFO      11:25:05 - ms_tile            [io] [0.8/0.8 1.9/1.9 0.2Gb] tile 0/5: reading MS rows 0~874943
# INFO      11:25:05 - data_handler       [io] [0.8/0.8 1.9/1.9 0.2Gb] reading DATA
# INFO      11:25:07 - data_handler       [io] [3.3/3.3 4.7/4.7 0.2Gb] reading BITFLAG
# INFO      11:25:15 - ms_tile            [io] [4.7/6.1 5.8/7.2 0.2Gb]   6.29% input visibilities flagged and/or deselected
# INFO      11:25:27 - main               [0.9/1.5 2.0/2.7 1.1Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.)
#  Traceback (most recent call last):
#   File "/usr/local/lib/python3.6/dist-packages/cubical/main.py", line 566, in main
#     stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 214, in run_process_loop
#     return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 274, in _run_multi_process_loop
#     if not io_futures[itile].result():
#   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
#     return self.__get_result()
#   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
#     raise self._exception
# concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
# 
# Traceback (most recent call last):
#   File "/stimela_mount/code/run.py", line 57, in <module>
#     subprocess.check_call(shlex.split(_runc))
#   File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
#     raise CalledProcessError(retcode, cmd)
# subprocess.CalledProcessError: Command '['gocubical', '--sol-jones', 'G', '--data-ms', '/stimela_mount/msdir/HighZ_1563407160_XXYY-Abell2626_corravg.ms', '--data-column', 'DATA', '--data-time-chunk', '240', '--data-freq-chunk', '0', '--model-list', 'MODEL_DATA', '--model-ddes', 'never', '--montblanc-dtype', 'float', '--weight-column', 'WEIGHT', '--flags-apply', '-cubical', '--madmax-enable', '1', '--madmax-estimate', 'corr', '--madmax-plot', '1', '--madmax-threshold', '0,10', '--sel-diag', '1', '--sol-term-iters', '5,0', '--bbc-save-to', '/stimela_mount/output/continuum/selfcal_products/bbc-gains-1-HighZ_1563407160_XXYY-Abell2626_corravg.parmdb', '--dist-ncpu', '3', '--dist-max-chunks', '3', '--out-name', '/stimela_mount/output/continuum/selfcal_products/HighZ_E1_v1_HighZ_1563407160_XXYY-Abell2626_corravg_1_cubical', '--out-overwrite', '1', '--out-mode', 'sc', '--out-casa-gaintables', '1', '--out-plots', '1', '--dist-nworker', '0', '--log-memory', '1', '--log-boring', '1', '--g-time-int', '240', '--g-freq-int', '0', '--g-clip-low', '0.5', '--g-clip-high', '2.0', '--g-solvable', '1', '--g-type', 'phase-diag', '--g-save-to', '/stimela_mount/output/continuum/selfcal_products/g-phase-gains-1-HighZ_1563407160_XXYY-Abell2626_corravg.parmdb', '--g-update-type', 'phase-diag', '--g-max-prior-error', '0.1', '--g-max-post-error', '0.1', '--degridding-OverS', '11', '--degridding-Support', '7', '--degridding-Nw', '100', '--degridding-wmax', '0', '--degridding-Padding', '1.7', '--degridding-NDegridBand', '16', '--degridding-MaxFacetSize', '0.25', '--degridding-MinNFacetPerAxis', '1']' returned non-zero exit status 1.
2020-05-01 13:25:28 CARACal.Stimela.calibrate_cubical_1_0 ERROR: podman returns error code 1

JSKenyon commented 4 years ago

Hey @healytwin1 - sorry for not getting back to you sooner. That is a genuinely strange bug as it means that one of CubiCal's processes has died mysteriously. My instinct is that it is something to do with the environment in which CubiCal is being run. I have zero experience with podman so I can't pretend to know if that is part of the problem. If you are interested in pursuing this further I would suggest trying to run CubiCal outside the container to see if the error persists. If not, then we need to look more closely at how it runs inside podman/singularity.

gigjozsa commented 4 years ago

@healytwin1 can this be closed, is it "solved"?

healytwin1 commented 4 years ago

Still running into issues. I have mailed ilifu support, but they are not that familiar with lockable memory. I have asked if the system has changed since January, still waiting on a response. I have had no problems with cubical in the past, so assuming the singularity setup hasn't changed, I don't think it is that. I haven't tried @JSKenyon suggestion yet, but not sure I will make much headway as I am a complete noob on all this software and I am a bit tied up at the moment with thesis work.

Jordatious commented 4 years ago

I was the one who said I wasn't too familiar with locked memory before this issue, but I have a reasonable idea now, and I don't think it's a locked memory issue. My understanding is that locked memory ensures that memory is always in RAM and never moved to swap space. However, we don't use swap space on ilifu nodes. Does CubiCal or some tool wrapped around it assume that swap space exists? I'm wondering if an invalid assumption is being made, and causing a crash because it can't allocate memory under those assumptions.

As you can see here, when you log in to an ilifu compute node from the Main partition, there's 236 GB of memory available, and 0 bytes of swap. So I don't think locked memory is relevant here. I've confirmed with the systems team that the memory parameters haven't been changed.

jcollier@slwrk-134:~$ free -h
total used free shared buff/cache available
Mem: 236G 25G 2.2G 1.0M 208G 208G
Swap: 0B 0B 0B

Dane also ran a test on a compute node from the Main partition for me, using 200 GB of memory, which you can see works fine here:

>>> from sys import getsizeof
>>> my_array= [None]*1024*1024*1024*25
>>> print(f'{getsizeof(my_array)/1024/1024/1024} GiB')
200.00000005960464 GiB

top - 14:59:20 up 29 days,  6:29,  1 user,  load average: 0.47, 0.49, 0.22
Tasks: 366 total,   1 running, 191 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 24763097+total, 18191920 free, 21095513+used, 18483916 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 34510536 avail Mem
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                 
27278 dane      20   0  0.195t 0.195t   5972 S   0.0 84.7   3:12.87 python3

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           236G        201G         17G        996K         17G         32G
Swap:            0B          0B          0B

As I've suggested to @healytwin1, she could run this on one of the nodes from the HighMem partition, to see how the behaviour changes, and how the output memory limits change, and if it still crashes.

o-smirnov commented 4 years ago

Thanks for looking into this, @Jordatious.

My understanding is that locked memory ensures that memory is always in RAM and never moved to swap space. However, we don't use swap space on ilifu nodes. Does CubiCal or some tool wrapped around it assume that swap space exists?

If you run ulimit -a, you'll see there's a limit of 16M set on locked memory. Which seems entirely unnecessary, since indeed, if you don't have swap anyway, then there is no reason to restrict locked memory. I think this is set system-wide in /etc/security/limits.conf.

CubiCal does not assume swap exists or explicitly use locked memory, but it does use shared memory segments, and I have a nagging suspicion that use of shared memory somehow comes out of the locked memory budget.

healytwin1 commented 4 years ago

So we step past the initial memory allocation error on the HighMem node, but it fails a little later on because it cannot allocate 1.85gb for an array... (see the attached log: caracal_E1-1274714-stdout.log )

I am currently running a test on one of the normal nodes that has had ulimit -l set to be unlimited. Will report back on that when it has finished running.

o-smirnov commented 4 years ago

Does the HighMem node have a ulimit -l in place?

Jordatious commented 4 years ago

No, the nodes from the HighMem partition have the same locked memory configuration as the nodes from the Main partition. Btw, that is the default locked memory configuration, so it was not changed either way. We've made a node available to @healytwin1 within a reservation where there is no limit on locked memory, to see if that changes anything. If the test is successful, we should be able to change it across the board, particularly since we don't use swap space. Also @healytwin1, if CubiCal is using it, it's useful to check that nothing exists in /dev/shm.

healytwin1 commented 4 years ago

So I have been running caracal on the unlimited memory lock node. (@Jordatious there is nothing in /dev/shm).

I ran a first test with ncpu = 5, but it failed because the max memory was going to exceed system memory (see attached log: caracal-test1.txt )

The second test with ncpu = 4 was within the max memory limits, but hit an error slightly later on as it could not assign 5gb. It managed to step past that error without failing though, but has been stuck at the same point for ~5hours (see log which I downloaded at 17h41 for time reference: caracal-test2.txt )

SpheMakh commented 4 years ago

Looking at this

Per-solver (worker) memory use estimated at 41.59GiB: 17.61% of total system memory.

This must be an error in cubical, or it's memory estimates are way off.

o-smirnov commented 4 years ago

Right, so no ulimit -l on that node, and we get the same problem, sigh...

@healytwin1, could you please try something else for us, run with an experimental branch. This is not likely to solve the problem, but it will give more informative logs:

In your virtualenv, pip install git+https://github.com/ratt-ru/Stimela@cubical_ddf.
Run caracal with -debug. You can use -sw selfcal to start at the selfcal worker right away.
Wait for it to fall over and post the logs. :)

@JSKenyon, this is really pernicious, look below. The [35.8/76.9 41.2/79.7 30.8Gb] shows current and peak memory use. This is a 256GB node, we're barely using any RAM, and yet it tells us "Unable to allocate 5.43 GiB"?

# INFO      13:01:51 - main               [io] [35.8/76.9 41.2/79.7 30.8Gb] I/O handler for load 1 save None failed with exception: Unable to allocate 5.43 GiB for an array with shape 
(1171800, 622, 2) and data type int32
# INFO      13:01:51 - main               [io] [35.8/76.9 41.2/79.7 30.8Gb] Traceback (most recent call last):
#   File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 457, in _io_handler
#     tile.load(load_model=load_model)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/data_handler/ms_tile.py", line 827, in load
#     flag_arr0[(self.bflagcol & self.dh._apply_bitflags) != 0] = FL.PRIOR
# MemoryError: Unable to allocate 5.43 GiB for an array with shape (1171800, 622, 2) and data type int32
# Process Process-1:
# Traceback (most recent call last):
#   File "/usr/lib/python3.6/concurrent/futures/process.py", line 175, in _process_worker
#     r = call_item.fn(*call_item.args, **call_item.kwargs)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 457, in _io_handler
#     tile.load(load_model=load_model)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/data_handler/ms_tile.py", line 827, in load
#     flag_arr0[(self.bflagcol & self.dh._apply_bitflags) != 0] = FL.PRIOR
# MemoryError: Unable to allocate 5.43 GiB for an array with shape (1171800, 622, 2) and data type int32

o-smirnov commented 4 years ago

This must be an error in cubical, or it's memory are way off.

Why? Do you find it too low? There's only 600-odd channels in this MS, so it doesn't look unreasonable to me. This is for a single worker, remember.

SpheMakh commented 4 years ago

It's not too low, but failing to allocate 17% of memory when you can use all of it sounds like a bug.

o-smirnov commented 4 years ago

The estimation is just that, an estimation, to give a warning up front. If you look at the actual memory use (inside the square brackets), the estimation is doing a reasonable job. It's being run with fairly conservative parameters, so the memory usage is low.

And it's not even failing to allocate 17%. It's failing to allocate a measly 5.43 GB!!!

But let's not be too hasty in calling it "a cubical bug". Smells more like some kind of unholy interaction between Python memory allocation (you can see by the trace it's deep within numpy that it fails), singularity, and something on the node.

caracal-pipeline / caracal

cubical falling over because of memory issues on idia #1017