Closed healytwin1 closed 4 years ago
Presumably the same as stimela issue 577.
Looks like it
I think this happens when cubical does not have enough shared memory. Podman and singularity cannot allocate resources, so your system admin has to increase the share memory usage.
I haven't had this issue with previous versions of caracal on singularity.
It looks singularity is stating to address this:
I haven't had this issue with previous versions of caracal on singularity.
On the same system, and with same data ?
Yup, the only thing that has changed is the config file and the version on caracal
Yes it's the same problem as @viralp reported in #577 (yes @viralp, I have not forgotten).
Sadly, I don't think this is a shared memory issue. If you look at the line where the exception is thrown, it's a bog-standard array allocation in all cases, not an shm allocation.
A bit puzzling, because the memory usage reported prior to that ("[71.2/77.2 75.5/80.2 25.6Gb]") is not high at all.
Could you please try to extract this into a standalone failure for us (here's an MS, here's a config.yml, things fail when I run caracal -sw selfcal).
In that case, can you try using installing the master branch of stimela julia
pip install --force-reinstall -U git+https://github.com/ratt-ru/Stimela
In that case, can you try using installing the master branch of stimela julia
pip install --force-reinstall -U git+https://github.com/ratt-ru/Stimela
Having done this with both the singularity and the podman installations has resulted now in the same errors - memory allocation not available.
Could you please try to extract this into a standalone failure for us (here's an MS, here's a config.yml, things fail when I run caracal -sw selfcal).
The MS I have been using spans 130MHz (~87gb before I do anything). I am attaching the config file used and the corresponding log files epoch1.txt log-caracal-podman.txt log-caracal-singularity.txt
which shows the errors.
INFO 18:24:17 - wisdom [0.8/0.8 3.7/3.8 0.0Gb] Peak I/O memory use estimated at 30.15GiB: 12.77% of total system memory.
# INFO 18:24:17 - wisdom [0.8/0.8 3.7/3.8 0.0Gb] Total peak memory usage estimated at 279.66GiB: 118.42% of total system memory.
# INFO 18:24:17 - main [0.8/0.8 3.7/3.8 0.0Gb] Exiting with exception: MemoryError(Estimated memory usage exceeds total system memory. Memory usage can be reduced by lowering the number of chunks, the dimensions of each chunk or the number of worker processes. This error can suppressed by unsetting --dist-safe.)
# Traceback (most recent call last):
This is good! The wisdom files are working. Given the estimated of memory usage overflow, I would re-run with 5 CPUs instead of 7
@SpheMakh when and how exactly can that happen and how can we warn/inform the user?
Also how to remove the error (why would the memory usage be reduced by reducing the number of CPUs?)
Glancing at the log (bear in mind I don't know precisely what you are doing) there are a handful of things which could be improved.
--data-time-chunk
could be set to be equal to the solution interval on G
. CubiCal operates in parallel over chunks, thus more, smaller chunks is usually preferable. --dist-max-chunks
is set to 1. This will basically disable parallelism as only a single chunk (with some caveats) will be loaded into memory. This should not be lower than --dist-ncpu
and I will consider putting a check in for this.--dist-nthread
can be set higher if you have spare compute capacity but are running into memory woes. This should (imperfectly) enable nested parallelism which enhances compute without compromising memory.In reply to your question @gigjozsa, the answer is implicit in the way CubiCal operates in parallel over chunks using multiprocessing. Each processor is assigned a chunk and processing that chunk has a certain memory footprint. By using fewer processes, fewer chunks will be processed simultaneously and in turn reduce the memory footprint.
For the example above I strongly suggest setting --data-time-chunk 120
, --dist-max-chunks 5
and --dist-ncpu 6
. Setting --dist-nthreads
to your spare compute capacity might also help.
Thanks, @JSKenyon Is there any way to predict this?
Another point to consider - does G
truly need to be solved over all the channels? By having infinite solution interval along an axis, it becomes impossible to chunk the data along that axis. This in turn limits the number of chunks that can be processed simultaneously. If anyone is interested/motivated, it might be interesting to solve for G
using some generous but finite (say 50% of the channels) solution interval and similar chunking. It might then be possible to quantify how the quality of the solution was impacted by the reduction in solution interval (my suspicion is that it may in fact be improved). The only lingering concern is one of source suppression, but someone with more know-how than I could likely quantify that too.
What I mean is if the user informs us about a maximum memory allocation, can we
@gigjozsa The wisdom module I added is a step in that direction but it is a non-trivial problem. Changing options for the user can also be a bit of a trap. Bear in mind that things like solution interval place some fundamental limits on problem size, particularly when they are infinite, and adjusting solution interval is a non-starter for many reasons.
Should we close this issue, then (once the discussion about the best parameter choice, which is very helpful, is finished), and create a new one to propagate wisdom warnings/insights to the caracal user?
Let's see if @healytwin1 's run goes through with the suggestions.
Changed those parameters, but still get the memory error on both podman and singularity.
Hey @healytwin1. Could you please post the log from your most recent failed run?
Thanks! Out of interest, are you the only person using the system? Those are genuine memory allocation errors.
It is entirely possible that my wisdom estimates are failing, but I would need the MS to be sure. The podman example is unlikely to work as it is configured with only 32GB of RAM.
Thanks! Out of interest, are you the only person using the system? Those are genuine memory allocation errors.
Yes. The podman one is my desktop, so I am the only one using it. And with the singularity run, I made sure that I was allocated the maximum memory and cpus for a single node on ilifu.
It is entirely possible that my wisdom estimates are failing, but I would need the MS to be sure. The podman example is unlikely to work as it is configured with only 32GB of RAM.
My desktop also has ~500gb swap mem, so thought this should still be able to run, albeit much slower than a higher memory system.
Ah ok. In the podman case, you can try setting --dist-safe 0
, but it will likely take days to process if it processes at all. Currently CubiCal is killing itself because it knows it will hit swap, which is usually game over.
Could you please post the output of ulimit -a
for the ilifu node?
@o-smirnov Was Viral also working on an ilifu/IDIA node? I am growing suspicious.
Ah ok. In the podman case, you can try setting
--dist-safe 0
, but it will likely take days to process if it processes at all. Currently CubiCal is killing itself because it knows it will hit swap, which is usually game over.
Hmm, okay. Will try that.
Could you please post the output of
ulimit -a
for the ilifu node?
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 967206
max locked memory (kbytes, -l) 16384
max memory size (kbytes, -m) 247463936
open files (-n) 16384
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 967206
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Maybe contact sysadmin. The maximum allocated RAM seems to be 16 GB, which is far too little. If I remember well you can actually specify that, but I believe ilifu changed some rules. I hope I will find the time to test this today.
Yeah I think @gigjozsa is right, the ulimit -l
setting is way too low. Can you change it yourself. @healytwin1? What does it say if you do ulimit -l unlimited
?
ulimit -l unlimited
gives:
bash: ulimit: max locked memory: cannot modify limit: Operation not permitted
I had used the following command that was supposed to assign the maximum resources for a particular node:
srun -N 1 --mem 236G --ntasks-per-node 1 --cpus-per-task 32 --time 13-24 --pty bash
I think you must shake the sysadmins about this. There's no reason to allocate you 236G, and then make only 16G lockable. It's downright perverse.
pinging @viralp
Sorry for jumping late into the discussion. I installed Caracal and Stimela (master branch) on the ike Rhodes server and com04 machine. This assures me the same packages and singularity installation. I was getting the same memory error on the ike as @healytwin1 reported. But the same process with the same config file ran smoothly on the com04 machine. So I think the issue is in system-wide memory allocation.
I have managed to step over the memory error (on podman) for now by frequency averaging my dataset, using a larger solution interval and decreasing the number of cpus, but I now hit another error which I do not understand:
# INFO 11:24:52 - wisdom [0.8/0.9 1.9/2.0 0.2Gb] Detected a total of 31.32GiB of system memory.
# INFO 11:24:52 - wisdom [0.8/0.9 1.9/2.0 0.2Gb] Per-solver (worker) memory use estimated at 6.74GiB: 21.51% of total system memory.
# INFO 11:24:52 - wisdom [0.8/0.9 1.9/2.0 0.2Gb] Peak I/O memory use estimated at 7.20GiB: 22.98% of total system memory.
# INFO 11:24:52 - wisdom [0.8/0.9 1.9/2.0 0.2Gb] Total peak memory usage estimated at 20.67GiB: 66.00% of total system memory.
# INFO 11:24:52 - data_handler [0.8/0.9 1.9/2.0 0.2Gb] inserting new column BITFLAG
# INFO 11:24:52 - data_handler [0.8/0.9 1.9/2.0 0.2Gb] inserting new column BITFLAG_ROW
# INFO 11:24:53 - data_handler [0.9/0.9 1.9/2.0 0.2Gb] auto-filling bitflag 'legacy' from FLAG/FLAG_ROW column. Please do not interrupt this process!
# INFO 11:24:53 - data_handler [0.9/0.9 1.9/2.0 0.2Gb] note that all other bitflags will be cleared by this
# INFO 11:25:05 - data_handler [0.9/1.5 1.9/2.7 0.2Gb] auto-fill complete
# INFO 11:25:05 - data_handler [0.9/1.5 1.9/2.7 0.2Gb] BITFLAG column defines the following flagsets: legacy:1
# INFO 11:25:05 - data_handler [0.9/1.5 1.9/2.7 0.2Gb] will exclude flagset cubical
# INFO 11:25:05 - data_handler [0.9/1.5 1.9/2.7 0.2Gb] flagset 'cubical' not found -- ignoring
# INFO 11:25:05 - data_handler [0.9/1.5 1.9/2.7 0.2Gb] applying BITFLAG mask 1 to input data
# INFO 11:25:05 - data_handler [0.9/1.5 1.9/2.7 0.2Gb] will save output flags into BITFLAG 'cubical' (2), and into FLAG/FLAG_ROW
# INFO 11:25:05 - main [0.9/1.5 2.0/2.7 0.2Gb] waiting for I/O on tile 0/5
# INFO 11:25:05 - main [io] [0.8/0.8 1.9/1.9 0.2Gb] loading tile 0/5
# INFO 11:25:05 - ms_tile [io] [0.8/0.8 1.9/1.9 0.2Gb] tile 0/5: reading MS rows 0~874943
# INFO 11:25:05 - data_handler [io] [0.8/0.8 1.9/1.9 0.2Gb] reading DATA
# INFO 11:25:07 - data_handler [io] [3.3/3.3 4.7/4.7 0.2Gb] reading BITFLAG
# INFO 11:25:15 - ms_tile [io] [4.7/6.1 5.8/7.2 0.2Gb] 6.29% input visibilities flagged and/or deselected
# INFO 11:25:27 - main [0.9/1.5 2.0/2.7 1.1Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.)
# Traceback (most recent call last):
# File "/usr/local/lib/python3.6/dist-packages/cubical/main.py", line 566, in main
# stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts)
# File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 214, in run_process_loop
# return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts)
# File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 274, in _run_multi_process_loop
# if not io_futures[itile].result():
# File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
# return self.__get_result()
# File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
# raise self._exception
# concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
#
# Traceback (most recent call last):
# File "/stimela_mount/code/run.py", line 57, in <module>
# subprocess.check_call(shlex.split(_runc))
# File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
# raise CalledProcessError(retcode, cmd)
# subprocess.CalledProcessError: Command '['gocubical', '--sol-jones', 'G', '--data-ms', '/stimela_mount/msdir/HighZ_1563407160_XXYY-Abell2626_corravg.ms', '--data-column', 'DATA', '--data-time-chunk', '240', '--data-freq-chunk', '0', '--model-list', 'MODEL_DATA', '--model-ddes', 'never', '--montblanc-dtype', 'float', '--weight-column', 'WEIGHT', '--flags-apply', '-cubical', '--madmax-enable', '1', '--madmax-estimate', 'corr', '--madmax-plot', '1', '--madmax-threshold', '0,10', '--sel-diag', '1', '--sol-term-iters', '5,0', '--bbc-save-to', '/stimela_mount/output/continuum/selfcal_products/bbc-gains-1-HighZ_1563407160_XXYY-Abell2626_corravg.parmdb', '--dist-ncpu', '3', '--dist-max-chunks', '3', '--out-name', '/stimela_mount/output/continuum/selfcal_products/HighZ_E1_v1_HighZ_1563407160_XXYY-Abell2626_corravg_1_cubical', '--out-overwrite', '1', '--out-mode', 'sc', '--out-casa-gaintables', '1', '--out-plots', '1', '--dist-nworker', '0', '--log-memory', '1', '--log-boring', '1', '--g-time-int', '240', '--g-freq-int', '0', '--g-clip-low', '0.5', '--g-clip-high', '2.0', '--g-solvable', '1', '--g-type', 'phase-diag', '--g-save-to', '/stimela_mount/output/continuum/selfcal_products/g-phase-gains-1-HighZ_1563407160_XXYY-Abell2626_corravg.parmdb', '--g-update-type', 'phase-diag', '--g-max-prior-error', '0.1', '--g-max-post-error', '0.1', '--degridding-OverS', '11', '--degridding-Support', '7', '--degridding-Nw', '100', '--degridding-wmax', '0', '--degridding-Padding', '1.7', '--degridding-NDegridBand', '16', '--degridding-MaxFacetSize', '0.25', '--degridding-MinNFacetPerAxis', '1']' returned non-zero exit status 1.
2020-05-01 13:25:28 CARACal.Stimela.calibrate_cubical_1_0 ERROR: podman returns error code 1
Hey @healytwin1 - sorry for not getting back to you sooner. That is a genuinely strange bug as it means that one of CubiCal's processes has died mysteriously. My instinct is that it is something to do with the environment in which CubiCal is being run. I have zero experience with podman so I can't pretend to know if that is part of the problem. If you are interested in pursuing this further I would suggest trying to run CubiCal outside the container to see if the error persists. If not, then we need to look more closely at how it runs inside podman/singularity.
@healytwin1 can this be closed, is it "solved"?
Still running into issues. I have mailed ilifu support, but they are not that familiar with lockable memory. I have asked if the system has changed since January, still waiting on a response. I have had no problems with cubical in the past, so assuming the singularity setup hasn't changed, I don't think it is that. I haven't tried @JSKenyon suggestion yet, but not sure I will make much headway as I am a complete noob on all this software and I am a bit tied up at the moment with thesis work.
I was the one who said I wasn't too familiar with locked memory before this issue, but I have a reasonable idea now, and I don't think it's a locked memory issue. My understanding is that locked memory ensures that memory is always in RAM and never moved to swap space. However, we don't use swap space on ilifu nodes. Does CubiCal or some tool wrapped around it assume that swap space exists? I'm wondering if an invalid assumption is being made, and causing a crash because it can't allocate memory under those assumptions.
As you can see here, when you log in to an ilifu compute node from the Main partition, there's 236 GB of memory available, and 0 bytes of swap. So I don't think locked memory is relevant here. I've confirmed with the systems team that the memory parameters haven't been changed.
jcollier@slwrk-134:~$ free -h
total used free shared buff/cache available
Mem: 236G 25G 2.2G 1.0M 208G 208G
Swap: 0B 0B 0B
Dane also ran a test on a compute node from the Main partition for me, using 200 GB of memory, which you can see works fine here:
>>> from sys import getsizeof
>>> my_array= [None]*1024*1024*1024*25
>>> print(f'{getsizeof(my_array)/1024/1024/1024} GiB')
200.00000005960464 GiB
top - 14:59:20 up 29 days, 6:29, 1 user, load average: 0.47, 0.49, 0.22
Tasks: 366 total, 1 running, 191 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 24763097+total, 18191920 free, 21095513+used, 18483916 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 34510536 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27278 dane 20 0 0.195t 0.195t 5972 S 0.0 84.7 3:12.87 python3
$ free -h
total used free shared buff/cache available
Mem: 236G 201G 17G 996K 17G 32G
Swap: 0B 0B 0B
As I've suggested to @healytwin1, she could run this on one of the nodes from the HighMem partition, to see how the behaviour changes, and how the output memory limits change, and if it still crashes.
Thanks for looking into this, @Jordatious.
My understanding is that locked memory ensures that memory is always in RAM and never moved to swap space. However, we don't use swap space on ilifu nodes. Does CubiCal or some tool wrapped around it assume that swap space exists?
If you run ulimit -a
, you'll see there's a limit of 16M set on locked memory. Which seems entirely unnecessary, since indeed, if you don't have swap anyway, then there is no reason to restrict locked memory. I think this is set system-wide in /etc/security/limits.conf
.
CubiCal does not assume swap exists or explicitly use locked memory, but it does use shared memory segments, and I have a nagging suspicion that use of shared memory somehow comes out of the locked memory budget.
So we step past the initial memory allocation error on the HighMem node, but it fails a little later on because it cannot allocate 1.85gb for an array... (see the attached log: caracal_E1-1274714-stdout.log )
I am currently running a test on one of the normal nodes that has had ulimit -l
set to be unlimited. Will report back on that when it has finished running.
Does the HighMem node have a ulimit -l
in place?
No, the nodes from the HighMem partition have the same locked memory configuration as the nodes from the Main partition. Btw, that is the default locked memory configuration, so it was not changed either way. We've made a node available to @healytwin1 within a reservation where there is no limit on locked memory, to see if that changes anything. If the test is successful, we should be able to change it across the board, particularly since we don't use swap space. Also @healytwin1, if CubiCal is using it, it's useful to check that nothing exists in /dev/shm
.
So I have been running caracal on the unlimited memory lock node. (@Jordatious there is nothing in /dev/shm
).
I ran a first test with ncpu = 5, but it failed because the max memory was going to exceed system memory (see attached log: caracal-test1.txt )
The second test with ncpu = 4 was within the max memory limits, but hit an error slightly later on as it could not assign 5gb. It managed to step past that error without failing though, but has been stuck at the same point for ~5hours (see log which I downloaded at 17h41 for time reference: caracal-test2.txt )
Looking at this
Per-solver (worker) memory use estimated at 41.59GiB: 17.61% of total system memory.
This must be an error in cubical, or it's memory estimates are way off.
Right, so no ulimit -l
on that node, and we get the same problem, sigh...
@healytwin1, could you please try something else for us, run with an experimental branch. This is not likely to solve the problem, but it will give more informative logs:
In your virtualenv, pip install git+https://github.com/ratt-ru/Stimela@cubical_ddf
.
Run caracal with -debug
. You can use -sw selfcal
to start at the selfcal worker right away.
Wait for it to fall over and post the logs. :)
@JSKenyon, this is really pernicious, look below. The [35.8/76.9 41.2/79.7 30.8Gb]
shows current and peak memory use. This is a 256GB node, we're barely using any RAM, and yet it tells us "Unable to allocate 5.43 GiB"?
# INFO 13:01:51 - main [io] [35.8/76.9 41.2/79.7 30.8Gb] I/O handler for load 1 save None failed with exception: Unable to allocate 5.43 GiB for an array with shape
(1171800, 622, 2) and data type int32
# INFO 13:01:51 - main [io] [35.8/76.9 41.2/79.7 30.8Gb] Traceback (most recent call last):
# File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 457, in _io_handler
# tile.load(load_model=load_model)
# File "/usr/local/lib/python3.6/dist-packages/cubical/data_handler/ms_tile.py", line 827, in load
# flag_arr0[(self.bflagcol & self.dh._apply_bitflags) != 0] = FL.PRIOR
# MemoryError: Unable to allocate 5.43 GiB for an array with shape (1171800, 622, 2) and data type int32
# Process Process-1:
# Traceback (most recent call last):
# File "/usr/lib/python3.6/concurrent/futures/process.py", line 175, in _process_worker
# r = call_item.fn(*call_item.args, **call_item.kwargs)
# File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 457, in _io_handler
# tile.load(load_model=load_model)
# File "/usr/local/lib/python3.6/dist-packages/cubical/data_handler/ms_tile.py", line 827, in load
# flag_arr0[(self.bflagcol & self.dh._apply_bitflags) != 0] = FL.PRIOR
# MemoryError: Unable to allocate 5.43 GiB for an array with shape (1171800, 622, 2) and data type int32
This must be an error in cubical, or it's memory are way off.
Why? Do you find it too low? There's only 600-odd channels in this MS, so it doesn't look unreasonable to me. This is for a single worker, remember.
It's not too low, but failing to allocate 17% of memory when you can use all of it sounds like a bug.
The estimation is just that, an estimation, to give a warning up front. If you look at the actual memory use (inside the square brackets), the estimation is doing a reasonable job. It's being run with fairly conservative parameters, so the memory usage is low.
And it's not even failing to allocate 17%. It's failing to allocate a measly 5.43 GB!!!
But let's not be too hasty in calling it "a cubical bug". Smells more like some kind of unholy interaction between Python memory allocation (you can see by the trace it's deep within numpy that it fails), singularity, and something on the node.
I have been trying to run the self-cal worker, but it is failing at the cubical step on the first selfcal loop. I have tried this on both my desktop (running podman) and on ilifu (with singularity), the errors are not the same, but the fail at the same point. I have used the same config file and the same dataset.
The podman error:
and the singularity error: