Closed tabaer closed 4 years ago
Are the containers public, and can you give more information about exactly what is being run in TADbit and Trinity stacks when things go into swap?
Any additional log information, singularity -d ....
, process-specific VIRT/RES data etc from the affected jobs may be useful. @cclerget may be able to have a think about possible causes tomorrow also.
The only similar issue with weird early swapping I've run into in the past with Trinity was on a RHEL6 cluster without any cgroups involvement, and it could be replicated outside of a Singularity container. It was specific to Java applications making very large memory allocations in that case, and seemed to be made worse by large GPFS pagepool settings.
The containers and data are owned by our users, not us, and I don't know if they're public. We will have to check with the users in question about that. I'm a sysadmin and don't know the science of these users very well, so I don't know exactly what they're doing.
I think you're on to something WRT the Trinity Java scenario you describe. I've instrumented one of the TADbit Python cases I mentioned to log its memory and swap behavior over time, and what I see is that its memory usage goes up to ~85 GB in the first couple minutes after startup, quickly drops to <10 GB, slowly creeps back up to ~40 GB and levels off until ~45 minutes in, whereupon the memory usage quickly shoots up to 120+ GB and the swapping begins. I'm rerunning that instrumented case with singularity -d right now and should have its results tomorrow.
vm.swappiness = 60
@tabaer It's too high for a HPC usage if you want to minimize swapping, a value between 0-10 would be more appropriate, this is more important as cgroups memory limit is in place for the job and with cgroup enabled the value of cgroup memory.swappiness
for the job should be 60, is it the case ?
If I understand cgroup memory correctly it tell the kernel to start to swap out when reaching 60% of free memory (60% of memory.limit_in_bytes
).
So a first attempt to avoid the job to swap would be to set memory.swappiness
to 0 if TORQUE allow to configure that unless it already set it to 0 or a low value. By setting memory.swappiness
to 0 that should kill the job once it hits memory.limit_in_bytes
without swapping and interference with other jobs.
vm.swappiness = 60
@tabaer It's too high for a HPC usage if you want to minimize swapping, a value between 0-10 would be more appropriate, this is more important as cgroups memory limit is in place for the job and with cgroup enabled the value of cgroup memory.swappiness for the job should be 60, is it the case ?
I've tried setting the memory.swappiness to 0 in some memory consumer test jobs, and it made no difference in the behavior of those that I could find. However, I have not tried with with these specific jobs, so I'll try that next. Thanks.
BTW, in my tests last night, I found that the TADbit Python case I mentioned earlier will use ~132 GB of memory if it's available, i.e. 10% more than what is usable on most of our nodes.
Setting swappiness=0 just in the job cgroup didn't have much of an effect on a Singularity job either. I'll try setting vm.swappiness=0 node-wide next and see if that changes anything.
What I've observed with vm.swappiness=0 is that there is indeed no swapping, but the processes still becomes unresponsive at the point where they would normally start swapping and quickly go into the unkillable D state. It looks like the OOM killer does fire to try to kill these processes, but not until after they're unkillable. At that point, the only way to get the node working again is to reboot it, which is actually worse than the default behavior at vm.swappiness=60 (where the node swaps its brains out but will recover if the job is killed).
What's really strange about this is that other memory intensive programs such as the memory consumer test I've alluded to don't behave like these Singularity jobs at vm.swappiness=60; they bump up against their memory limits and maybe swap a little bit before they are killed off by the OOM killer as expected.
@tabaer - when they are stuck in D state and are unkillable, do you know what's going on in the jobs? Are they in the midst of heavy I/O - and it's possibly GPFS related stuff being swapped out that is resulting in the system getting hung-up and OOM kills not taking effect?
If what's running inside these Singularity jobs is doing a lot of large file I/O through GPFS it'd be interesting to know if there's the same behavior if the data is on some other form of storage, if available. Also, I wonder if you've looked at slabtop
and seen anything weird with slab allocations?
I ask this as I previously worked in a place where we had some issues with certain GPFS configs and weird OOM behavior - though that was on nodes with smaller RAM and an excessive pagepool size I seem to recall. It was also observed in non-singularity jobs... but they were similar bioinformatics large data / high mem usage tools as far as I can remember.
Final question - have you run the singularity stuff direct from a shell in any of these tests vs under Torque with its cgroups in place? If so, same behavior?
Sorry I'm just firing a bunch more questions. This seems like really strange behavior, and I have no real clue what would cause it. It's giving me flash backs to an unrelated maddening thing with RHEL6 huge pages and kswapd going crazy from my HPC admin past.
Another thing I might suggest is an appeal on the mailing list... there are potentially folks there with ideas who might not see this issue.
https://groups.google.com/a/lbl.gov/forum/#!forum/singularity
If what's running inside these Singularity jobs is doing a lot of large file I/O through GPFS it'd be interesting to know if there's the same behavior if the data is on some other form of storage, if available. Also, I wonder if you've looked at slabtop and seen anything weird with slab allocations?
I have not tried running these cases on local disk, because I've been trying to simulate what our users have been doing as much as possible.
I also tried this with vm.swappiness=10, which had pretty much the same behavior as the vm.swappiness=60 except it waited a little longer to start swapping. My next test is going to be disabling swap on the node.
Going back through this a bit I noticed I missed in the first post...
settled out to using ~80GB of physical memory and 48GB of swap
... being suspiciously equal to the node total ...
128GB of memory (~120GB usable)
As if the process had a view that the total RAM on the machine was 128GB and entirely available to it, so it could expand to that. Not clear exactly what triggers it to swap way before the 120GB usable is hit... but it's something a bit odd to look at.
Are you able to post exact detail of the cgroup config that Torque is enforcing for the job, or if you are setting manually when trying to debug?
Then - if you submit a job which doesn't run the real workload, but does singularity exec cat /proc/self/cgroup
- do you get the expected output? Any non-default singularity config? Is it running in setuid mode or using user namespace?
Disabling swap looks a lot like vm.swappiness=0; the processes get into the unkillable D state, and the OOM killer can't kill them, and the only way to get the node back in a sane state is to reboot it.
Disabling swap looks a lot like vm.swappiness=0; the processes get into the unkillable D state, and the OOM killer can't kill them, and the only way to get the node back in a sane state is to reboot it.
Which processes is it in this case? TADbit / trinity in Singularity / the memory hog? Can you provide the command line or the Torque job?
Looking at the trinity case it appears the butterfly portion of trinity is going to expect to allocate a 10GB Java heap for each CPU it's told to use
https://trinityrnaseq.github.io/performance/mem.html
Wondering what --CPU
set to for the run there, and if early swapping in at least the trinity case could be related to how massive amounts of Java heap is being allocated? Just trying to pin down some kind of specific area to think about first.
How is trinity being run compared to their Singularity example?
https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-in-Docker#trinity_singularity
Disabling swap looks a lot like vm.swappiness=0; the processes get into the unkillable D state, and the OOM killer can't kill them, and the only way to get the node back in a sane state is to reboot it.
Which processes is it in this case?
Sorry, it's the TADbit Python case. The job executes it like this:
singularity -d exec tadbit.sif python WT_Compartments_and_TADs_f_WT.py
The Python script executed above is from the user.
if you submit a job which doesn't run the real workload, but does singularity exec cat /proc/self/cgroup - do you get the expected output?
Yes, it looks the same AFAICT:
troy@owens-login02:/fs/scratch/PZS0708/troy/WT_Tcell_matrix$ qsub -I -l nodes=1:ppn=28,mem=118gb
qsub: waiting for job 9369100.owens-batch.ten.osc.edu to start
qsub: job 9369100.owens-batch.ten.osc.edu ready
troy@o0109:~$ cd $PBS_O_WORKDIR
troy@o0109:/fs/scratch/PZS0708/troy/WT_Tcell_matrix$ cat /proc/self/cgroup
11:devices:/torque/9369100.owens-batch.ten.osc.edu
10:pids:/system.slice/pbs_mom.service
9:cpuset:/torque/9369100.owens-batch.ten.osc.edu
8:net_prio,net_cls:/
7:perf_event:/
6:blkio:/system.slice/pbs_mom.service
5:cpuacct,cpu:/torque/9369100.owens-batch.ten.osc.edu
4:memory:/torque/9369100.owens-batch.ten.osc.edu
3:freezer:/
2:hugetlb:/
1:name=systemd:/system.slice/pbs_mom.service
troy@o0109:/fs/scratch/PZS0708/troy/WT_Tcell_matrix$ singularity exec tadbit.sif cat /proc/self/cgroup
11:devices:/torque/9369100.owens-batch.ten.osc.edu
10:pids:/system.slice/pbs_mom.service
9:cpuset:/torque/9369100.owens-batch.ten.osc.edu
8:net_prio,net_cls:/
7:perf_event:/
6:blkio:/system.slice/pbs_mom.service
5:cpuacct,cpu:/torque/9369100.owens-batch.ten.osc.edu
4:memory:/torque/9369100.owens-batch.ten.osc.edu
3:freezer:/
2:hugetlb:/
1:name=systemd:/system.slice/pbs_mom.service
Any non-default singularity config? Is it running in setuid mode or using user namespace?
AFAIK, we're not doing anything clever in our Singularity configs other than bind-mounting our GPFS file systems and a couple config files. I believe it's using user namespaces, but I will have to verify that with @treydock.
BTW, I can't tell if this is relevant or not, but once these processes get stuck in the D state, there are messages in syslog about them where the kernel stack traces mention squashfs, which is presumably related to accessing the container:
Feb 13 16:26:41 o0448 kernel: INFO: task python:176153 blocked for more than 120
seconds.
Feb 13 16:26:41 o0448 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
Feb 13 16:26:41 o0448 kernel: python D ffff9da4a94d8000 0 176153 14
7944 0x00100004
Feb 13 16:26:41 o0448 kernel: Call Trace:
Feb 13 16:26:41 o0448 kernel: [<ffffffff84d60368>] ? queued_spin_lock_slowpath+0
xb/0xf
Feb 13 16:26:41 o0448 kernel: [<ffffffff84d6b8b9>] schedule+0x29/0x70
Feb 13 16:26:41 o0448 kernel: [<ffffffffc0ca5735>] squashfs_cache_get+0x105/0x3c
0 [squashfs]
Feb 13 16:26:41 o0448 kernel: [<ffffffffc0ca5ed8>] ? squashfs_read_metadata+0x58
/0x130 [squashfs]
Feb 13 16:26:41 o0448 kernel: [<ffffffff846c4280>] ? wake_up_atomic_t+0x30/0x30
Feb 13 16:26:41 o0448 kernel: [<ffffffffc0ca6001>] squashfs_get_datablock+0x21/0
x30 [squashfs]
Feb 13 16:26:41 o0448 kernel: [<ffffffffc0ca7262>] squashfs_readpage+0x882/0xbe0
[squashfs]
Feb 13 16:26:41 o0448 kernel: [<ffffffff847c54b8>] __do_page_cache_readahead+0x2
48/0x260
Feb 13 16:26:41 o0448 kernel: [<ffffffff847c5aa1>] ra_submit+0x21/0x30
Feb 13 16:26:41 o0448 kernel: [<ffffffff847ba575>] filemap_fault+0x105/0x490
Feb 13 16:26:41 o0448 kernel: [<ffffffff848353be>] ? mem_cgroup_reclaim+0x4e/0x1
20
Feb 13 16:26:41 o0448 kernel: [<ffffffff847e618a>] __do_fault.isra.59+0x8a/0x100
Feb 13 16:26:41 o0448 kernel: [<ffffffff847e673c>] do_read_fault.isra.61+0x4c/0x
1b0
Feb 13 16:26:41 o0448 kernel: [<ffffffff847eb0c4>] handle_pte_fault+0x2f4/0xd10
Feb 13 16:26:41 o0448 kernel: [<ffffffff847edbfd>] handle_mm_fault+0x39d/0x9b0
Feb 13 16:26:41 o0448 kernel: [<ffffffff84d73623>] __do_page_fault+0x203/0x4f0
Feb 13 16:26:41 o0448 kernel: [<ffffffff84d73945>] do_page_fault+0x35/0x90
Feb 13 16:26:41 o0448 kernel: [<ffffffff84d6fb7f>] ? error_exit+0x1f/0x60
Feb 13 16:26:41 o0448 kernel: [<ffffffff84d6f778>] page_fault+0x28/0x30
Thanks for this information.
If the container is on GPFS then I think the squashfs dmesg stuff makes sense. From past experience this certainly sounds like the situation where the processes are stuck uninterruptible as they are trying to do I/O and the GPFS daemons / kernel stuff can't satisfy the I/O due to there being no RAM available on the machine for them to work properly.
Given the cgroup limits are visible from a container process as they should be, at this point I'm not clear that there's any Singularity related cause of this behavior, without it being pointed to by the same TADbit or Trinity jobs succeeding, with the same job limits, if they are not run in a container. Has that been confirmed?
I don't think we can go much further into troubleshooting without getting a lot more specific information about the exact jobs that are failing, and state at the time they fail, including things like:
slabtop
show large allocations related to I/O at problematic times?singularity -d .....
is used to run the job and you hit these issue?Cheers.
What's the GPFS pagepool size on the nodes?
On the cluster where the TADbit cases have been seen (Owens), it's 1.5 GB out of total physical memory of 128GB. On the cluster where the Trinity case was observed (Pitzer), it's 3 GB out of total physical memory of 192 GB.
If you set the cgroups memory limit << RAM (e.g. 96GB) do you observe proper execution or functional cgroup OOM kills?
No. That's actually how we found this issue in the first place -- one of these TADbit Python jobs asked for nodes=1:ppn=24,mem=64gb and ended up ~12GB into swap with ~60GB of memory free on the node. It turned out that that particular case actually needed a memory limit more like 70GB to keep from swapping. With the second TADbit Python case that I've been using today (the one that actually needs 132GB total), if I run it asking for nodes=1:ppn=28,mem=64gb (i.e. about half of physical memory, much less than it needs), it will go 48GB into swap in a couple minutes while there's >100GB of memory free.
A snapshot process listing at a problematic time, showing RES/VIRT etc. allocations.
During the TADbit Python case I mentioned above that requested 64gb when it needs more like 132gb:
# ps auxwf
[...system processes removed for brevity...]
root 25428 0.0 0.0 389624 84428 ? SLsl Feb03 12:28 /opt/torque/sbin/pbs_mom -d /var/spool/torque -H o0116
6624 9970 0.0 0.0 125868 860 ? Ss 17:51 0:00 \_ -bash /var/spool/torque/mom_priv/jobs/9369472.owens-ba.
6624 10144 3.1 0.0 147952 2316 ? S 17:51 0:45 \_ /usr/bin/python ./memmon
6624 10145 0.0 0.0 115372 760 ? S 17:51 0:00 \_ /bin/bash ./swapmon
6624 10154 2.2 0.0 112288 808 ? S 17:51 0:32 | \_ sar -B 1 14400
6624 10156 3.2 0.0 117468 988 ? S 17:51 0:45 | \_ sadc 1 14401 -z
6624 10146 0.1 0.0 116896 1032 ? S 17:51 0:02 \_ /bin/bash ./slabmon
root 15682 0.0 0.0 19756 344 ? D 18:15 0:00 | \_ sudo /bin/cat /proc/slabinfo
6624 10147 0.5 0.0 420960 752 ? Sl 17:51 0:07 \_ Singularity runtime parent
6624 10224 4.3 0.2 3402396 279500 ? Sl 17:51 1:02 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11100 4.3 1.0 4957852 1365872 ? D 17:52 0:59 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11101 4.3 1.1 4858524 1476348 ? D 17:52 0:58 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11102 4.1 0.2 3815068 347864 ? S 17:52 0:57 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11103 4.3 1.0 4898972 1363504 ? D 17:52 0:58 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11104 4.0 0.4 4075420 590392 ? D 17:52 0:54 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11105 4.2 0.9 4922780 1285884 ? D 17:52 0:58 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11106 3.7 0.1 3570588 219844 ? S 17:52 0:50 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11107 4.2 0.1 4821916 232084 ? D 17:52 0:57 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11108 4.2 0.1 4811164 192888 ? D 17:52 0:57 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11109 3.7 0.1 3676828 212160 ? S 17:52 0:51 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11110 4.2 0.9 4855708 1221716 ? D 17:52 0:57 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11111 4.1 0.3 4415388 503028 ? D 17:52 0:56 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11112 4.1 0.9 4851612 1282192 ? D 17:52 0:56 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11113 3.9 1.0 4922268 1336456 ? D 17:52 0:53 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11114 4.0 0.1 4852892 258196 ? D 17:52 0:55 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11115 4.0 0.5 4071324 783204 ? D 17:52 0:55 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11116 4.0 0.7 4826012 924024 ? D 17:52 0:55 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11117 93.9 0.2 4014748 360652 ? R 17:52 21:16 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11118 4.0 0.9 4818844 1288868 ? D 17:52 0:54 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11119 3.5 0.2 3616412 300944 ? S 17:52 0:48 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11120 3.9 0.1 4928156 191964 ? D 17:52 0:54 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11121 3.7 0.1 3633820 212124 ? S 17:52 0:50 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11122 3.9 1.0 4957852 1336356 ? D 17:52 0:53 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11123 3.3 0.1 3753116 214172 ? S 17:52 0:45 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11124 3.8 1.0 4923292 1447716 ? D 17:52 0:52 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11125 3.8 0.2 3986076 351304 ? D 17:52 0:52 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11127 3.8 0.4 4893596 564372 ? D 17:52 0:51 \_ python WT_Compartments_and_TADs_f_WT.py
6624 11129 3.3 0.1 3675804 213036 ? S 17:52 0:45 \_ python WT_Compartments_and_TADs_f_WT.py
[...system processes removed for brevity...]
# free
total used free shared buff/cache available
Mem: 131912884 20336556 110194980 288324 1381348 110325260
Swap: 50331644 50331644 0
Again, keep in mind that it is ~48GB into swap while there is >100GB of memory free.
More data coming tomorrow. (I have slabinfo data, but I'm not really sure how to interpret it.)
What's the GPFS pagepool size on the nodes?
On the cluster where the TADbit cases have been seen (Owens), it's 1.5 GB out of total physical memory of 128GB. On the cluster where the Trinity case was observed (Pitzer), it's 3 GB out of total physical memory of 192 GB.
Okay - so that's not huge and wouldn't have a material impact here.
If you set the cgroups memory limit << RAM (e.g. 96GB) do you observe proper execution or functional cgroup OOM kills?
No. That's actually how we found this issue in the first place -- one of these TADbit Python jobs asked for nodes=1:ppn=24,mem=64gb and ended up ~12GB into swap with ~60GB of memory free on the node. It turned out that that particular case actually needed a memory limit more like 70GB to keep from swapping. With the second TADbit Python case that I've been using today (the one that actually needs 132GB total), if I run it asking for nodes=1:ppn=28,mem=64gb (i.e. about half of physical memory, much less than it needs), it will go 48GB into swap in a couple minutes while there's >100GB of memory free.
So this makes me think things are working properly. If you set mem=64gb
and that's enforced by cgroups then it should start going into swap when it's going to reach that limit, which would result in ~60GB
free on a 128GB
node?
https://jvns.ca/blog/2017/02/17/mystery-swap/
My model of memory limits on cgroups was always “if you use more than X memory, you will get killed right away”. It turns out that that assumptions was wrong! If you use more than X memory, you can still use swap!
And apparently some kernels also support setting separate swap limits. So you could set your memory limit to X and your swap limit to 0, which would give you more predictable behavior. Swapping is weird and confusing.
The fact that with mem=64gb
the 1st TADbit job goes 12GB into swap, and the second one goes 48GB into swap just seems to me that they ultimately require more than the cgroup memory limit, but the cgroup is allowing swap. TADbit is probably working on some very large structure that is trying doing a huge allocation in one step - leading to them swapping before you see them 'using' the 64GB specified in the mem/cgroup limit.
Again, keep in mind that it is ~48GB into swap while there is >100GB of memory free.
The >100GB free
is not likely relevant - everything is related to the --mem 64GB
limit imposed with cgroups. The process will swap if it has hit that / is trying to do something that will exceed that, regardless of how much is free on the host.
Looking around a bit I came across this:
https://github.com/adaptivecomputing/torque/issues/372
And going on a bit from there and searching around, it appears using --mem
only sets memory.limit_in_bytes
- swap is separate under cgroups, so the jobs will swap. You'd need to use --vmem
at job submissions which sets both memory.limit_in_bytes
and memory.memsw.limit_in_bytes
to be translated into a job cgroup to stop the swapping, and see OOM kills as expected.
going on a bit from there and searching around, it appears using --mem only sets memory.limit_in_bytes - swap is separate under cgroups, so the jobs will swap. You'd need to use --vmem at job submissions which sets both memory.limit_in_bytes and memory.memsw.limit_in_bytes to be translated into a job cgroup to stop the swapping, and see OOM kills as expected.
That's not the case in our environment -- as I mentioned at the beginning of the issue, memory.limit_in_bytes and memory.memsw.limit_in_bytes are identical. (That's fixed up by our job prologue.) For instance, in the TADbit Python job whose process list I posted last night that requests mem=64gb but needs ~132gb, I have it print out the cgroup limit_in_bytes files before it runs Singularity:
memory.limit_in_bytes=68719476736
memory.memsw.limit_in_bytes=68719476736
One of the monitoring processes I've added to these jobs is a Python script that logs several of the values in the memory cgroup once a second, including memory.usage _in_bytes and memory.memsw.usage_in_bytes. What I see with these jobs is that the former will shoot up to the limit and then drop to limit-swap (~16GB in the TADbit Python case that requests mem=64gb but needs ~132gb), but neither limit is ever actually exceeded and so the OOM killer never fires. It's very strange.
And again, we've only seen this behavior with these two classes of Singularity jobs. Other memory-intensive jobs are cleaned up by the OOM killer as expected.
That's not the case in our environment -- as I mentioned at the beginning of the issue, memory.limit_in_bytes and memory.memsw.limit_in_bytes are identical. (That's fixed up by our job prologue.)
Apologies - I'd also come across the OSC docs here which led me to think they wouldn't be, as it mentions the mem
vs vmem
stuff - https://www.osc.edu/documentation/knowledge_base/out_of_memory_oom_or_excessive_memory_usage
And again, we've only seen this behavior with these two classes of Singularity jobs. Other memory-intensive jobs are cleaned up by the OOM killer as expected.
Can you specifically confirm if these TADbit and Trinity jobs have been run outside of Singularity, in a straight conda environment or something similar, and that the behaviour is not the same?
Myself and @cclerget are at a loss regarding how Singularity could impact this, given the cgroups are showing up correctly when you did the `singularity exec cat /proc/self/cgroup
What I see with these jobs is that the former will shoot up to the limit and then drop to limit-swap (~16GB in the TADbit Python case that requests mem=64gb but needs ~132gb), but neither limit is ever actually exceeded and so the OOM killer never fires. It's very strange.
This is indeed very strange. Wildly speculating It feels a bit like these jobs are doing something very odd memory management wise to ride right up to the limit of what is available... and there is a weird interaction with the cgroup limits.
Can you specifically confirm if these TADbit and Trinity jobs have been run outside of Singularity, in a straight conda environment or something similar, and that the behaviour is not the same?
I have no way to do that myself. I will ask the users if they've ever tried it, but my suspicion is no.
Can you specifically confirm if these TADbit and Trinity jobs have been run outside of Singularity, in a straight conda environment or something similar, and that the behaviour is not the same?
I have no way to do that myself. I will ask the users if they've ever tried it, but my suspicion is no.
Thanks. I'm out of ideas here really, other than a non-Singularity specific strange interaction between the way these applications are dealing with memory and the cgroup limits.
Given that the singularity exec cat /proc/self/cgroup
shows the expected output then I don't understand how Singularity could be influencing the behavior. I've chatted to @cclerget also and as far as he is aware once the pid of a process is set in cgroups there is no way for the process and all its descendants to escape it unless there is a kernel bug somewhere.
I've changed the title of the issue to specify "Tadbit" and "Trinity" in case it helps anyone driving-by who might have come across something to notice this issue.
@tabaer Have you checked that memory.use_hierarchy
is set to 1 for the job cgroup ?
@tabaer Have you checked that memory.use_hierarchy is set to 1 for the job cgroup ?
Yes, it is:
memory.use_hierarchy=1
@tabaer And what if adjust the limit memory.memsw.limit_in_bytes
to be saying 8G above the memory.limit_in_bytes
, maybe that triggered some strange path in kernel by setting them to identical values.
Also have you checked that running a memory stress app can trigger OOM killer from inside a singularity container because I see mention of consumer tests but was it done inside a Singularity container ?
Also have you checked that running a memory stress app can trigger OOM killer from inside a singularity container because I see mention of consumer tests but was it done inside a Singularity container ?
I just tried running my memory consumer test job inside the TADbit container. It appeared to be killed by the OOM killer as expected with minimal swapping.
I just tried running my memory consumer test job inside the TADbit container. It appeared to be killed by the OOM killer as expected with minimal swapping.
It confirms the issue lies in the application itself (and maybe in conjunction with a cgroup issue ?), I would suggest to test other versions older or newer whatever of those apps to see if it helps.
I'm going to close this issue now, as we got to the point of having decent confidence this is something specific to the application (likely its interaction with cgroups limits), rather than Singularity.
Please don't hesitate to re-open if any further troubleshooting points to a different conclusion. Thanks.
We’ve seen some situations with a couple of our users’ Singularity jobs where the jobs will push the nodes several gigabytes into swap even though there is plenty of physical memory available. This causes problems with other users’ jobs or even system services (particularly the IBM GPFS mmfsd daemon) on the node getting swapped out. We have not been able to reproduce this behavior other than using these Singularity jobs. We are wondering if this behavior has been observed at other sites
Copying @dpjohnson, @treydock, and @ZQyou, as they are also at @OSC.
Version of Singularity:
Expected behavior
Jobs run and use memory up to their requested limits without swapping.
Actual behavior
One example of this using a TADbit Python image ran on a node with 28 cores and 128GB of memory (~120GB usable). The job pushed the node into swap with >11GB of memory free and eventually settled out to using ~80GB of physical memory and 48GB of swap (all available) while having ~45GB of memory free. At no time did the job appear to exceed the memory or memory+swap limits set in its memory cgroup, which in any case were identical.
Steps to reproduce this behavior
We have not been able to find a simple, self-contained reproducer of this problem. The most common offenders are jobs running Singularity containers with either the TADbit Python stack (https://github.com/3DGenomes/TADbit) or the Trinity RNA-seq stack (https://github.com/trinityrnaseq/trinityrnaseq).
What OS/distro are you running
Other potentially relevant system info:
How did you install Singularity
RPMs built from the upstream spec file.