Strange swapping behavior with TADbit / Trinity container jobs and cgroups memory limits

apptainer / singularity

Singularity has been renamed to Apptainer as part of us moving the project to the Linux Foundation. This repo has been persisted as a snapshot right before the changes.

https://github.com/apptainer/apptainer

Other

2.53k stars 424 forks source link

Strange swapping behavior with TADbit / Trinity container jobs and cgroups memory limits #5041

Closed tabaer closed 4 years ago

tabaer commented 4 years ago

We’ve seen some situations with a couple of our users’ Singularity jobs where the jobs will push the nodes several gigabytes into swap even though there is plenty of physical memory available. This causes problems with other users’ jobs or even system services (particularly the IBM GPFS mmfsd daemon) on the node getting swapped out. We have not been able to reproduce this behavior other than using these Singularity jobs. We are wondering if this behavior has been observed at other sites

Copying @dpjohnson, @treydock, and @ZQyou, as they are also at @OSC.

Version of Singularity:

troy@owens-login02:~$ singularity version
3.5.2-1.el7

Expected behavior

Jobs run and use memory up to their requested limits without swapping.

Actual behavior

One example of this using a TADbit Python image ran on a node with 28 cores and 128GB of memory (~120GB usable). The job pushed the node into swap with >11GB of memory free and eventually settled out to using ~80GB of physical memory and 48GB of swap (all available) while having ~45GB of memory free. At no time did the job appear to exceed the memory or memory+swap limits set in its memory cgroup, which in any case were identical.

Steps to reproduce this behavior

We have not been able to find a simple, self-contained reproducer of this problem. The most common offenders are jobs running Singularity containers with either the TADbit Python stack (https://github.com/3DGenomes/TADbit) or the Trinity RNA-seq stack (https://github.com/trinityrnaseq/trinityrnaseq).

What OS/distro are you running

troy@owens-login02:~$ cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.6 (Maipo)

Other potentially relevant system info:

NFS-root environment
TORQUE 6.1.2 with cgroup support enabled
In the job’s memory cgroup, memory.limit_in_bytes and memory.memsw.limit_in_bytes are identical.
VM related sysctl settings:
- vm.overcommit_kbytes = 0
- vm.overcommit_memory = 0
- vm.overcommit_ratio = 50
- vm.swappiness = 60

How did you install Singularity

RPMs built from the upstream spec file.

dtrudg commented 4 years ago

Are the containers public, and can you give more information about exactly what is being run in TADbit and Trinity stacks when things go into swap?

Any additional log information, singularity -d .... , process-specific VIRT/RES data etc from the affected jobs may be useful. @cclerget may be able to have a think about possible causes tomorrow also.

The only similar issue with weird early swapping I've run into in the past with Trinity was on a RHEL6 cluster without any cgroups involvement, and it could be replicated outside of a Singularity container. It was specific to Java applications making very large memory allocations in that case, and seemed to be made worse by large GPFS pagepool settings.

tabaer commented 4 years ago

The containers and data are owned by our users, not us, and I don't know if they're public. We will have to check with the users in question about that. I'm a sysadmin and don't know the science of these users very well, so I don't know exactly what they're doing.

I think you're on to something WRT the Trinity Java scenario you describe. I've instrumented one of the TADbit Python cases I mentioned to log its memory and swap behavior over time, and what I see is that its memory usage goes up to ~85 GB in the first couple minutes after startup, quickly drops to <10 GB, slowly creeps back up to ~40 GB and levels off until ~45 minutes in, whereupon the memory usage quickly shoots up to 120+ GB and the swapping begins. I'm rerunning that instrumented case with singularity -d right now and should have its results tomorrow.

cclerget commented 4 years ago

vm.swappiness = 60

@tabaer It's too high for a HPC usage if you want to minimize swapping, a value between 0-10 would be more appropriate, this is more important as cgroups memory limit is in place for the job and with cgroup enabled the value of cgroup memory.swappiness for the job should be 60, is it the case ? If I understand cgroup memory correctly it tell the kernel to start to swap out when reaching 60% of free memory (60% of memory.limit_in_bytes).

So a first attempt to avoid the job to swap would be to set memory.swappiness to 0 if TORQUE allow to configure that unless it already set it to 0 or a low value. By setting memory.swappiness to 0 that should kill the job once it hits memory.limit_in_bytes without swapping and interference with other jobs.

tabaer commented 4 years ago

vm.swappiness = 60

@tabaer It's too high for a HPC usage if you want to minimize swapping, a value between 0-10 would be more appropriate, this is more important as cgroups memory limit is in place for the job and with cgroup enabled the value of cgroup memory.swappiness for the job should be 60, is it the case ?

I've tried setting the memory.swappiness to 0 in some memory consumer test jobs, and it made no difference in the behavior of those that I could find. However, I have not tried with with these specific jobs, so I'll try that next. Thanks.

BTW, in my tests last night, I found that the TADbit Python case I mentioned earlier will use ~132 GB of memory if it's available, i.e. 10% more than what is usable on most of our nodes.

tabaer commented 4 years ago

Setting swappiness=0 just in the job cgroup didn't have much of an effect on a Singularity job either. I'll try setting vm.swappiness=0 node-wide next and see if that changes anything.

tabaer commented 4 years ago

What I've observed with vm.swappiness=0 is that there is indeed no swapping, but the processes still becomes unresponsive at the point where they would normally start swapping and quickly go into the unkillable D state. It looks like the OOM killer does fire to try to kill these processes, but not until after they're unkillable. At that point, the only way to get the node working again is to reboot it, which is actually worse than the default behavior at vm.swappiness=60 (where the node swaps its brains out but will recover if the job is killed).

What's really strange about this is that other memory intensive programs such as the memory consumer test I've alluded to don't behave like these Singularity jobs at vm.swappiness=60; they bump up against their memory limits and maybe swap a little bit before they are killed off by the OOM killer as expected.

dtrudg commented 4 years ago

@tabaer - when they are stuck in D state and are unkillable, do you know what's going on in the jobs? Are they in the midst of heavy I/O - and it's possibly GPFS related stuff being swapped out that is resulting in the system getting hung-up and OOM kills not taking effect?

If what's running inside these Singularity jobs is doing a lot of large file I/O through GPFS it'd be interesting to know if there's the same behavior if the data is on some other form of storage, if available. Also, I wonder if you've looked at slabtop and seen anything weird with slab allocations?

I ask this as I previously worked in a place where we had some issues with certain GPFS configs and weird OOM behavior - though that was on nodes with smaller RAM and an excessive pagepool size I seem to recall. It was also observed in non-singularity jobs... but they were similar bioinformatics large data / high mem usage tools as far as I can remember.

Final question - have you run the singularity stuff direct from a shell in any of these tests vs under Torque with its cgroups in place? If so, same behavior?

Sorry I'm just firing a bunch more questions. This seems like really strange behavior, and I have no real clue what would cause it. It's giving me flash backs to an unrelated maddening thing with RHEL6 huge pages and kswapd going crazy from my HPC admin past.

dtrudg commented 4 years ago

Another thing I might suggest is an appeal on the mailing list... there are potentially folks there with ideas who might not see this issue.

https://groups.google.com/a/lbl.gov/forum/#!forum/singularity

tabaer commented 4 years ago

If what's running inside these Singularity jobs is doing a lot of large file I/O through GPFS it'd be interesting to know if there's the same behavior if the data is on some other form of storage, if available. Also, I wonder if you've looked at slabtop and seen anything weird with slab allocations?

I have not tried running these cases on local disk, because I've been trying to simulate what our users have been doing as much as possible.

I also tried this with vm.swappiness=10, which had pretty much the same behavior as the vm.swappiness=60 except it waited a little longer to start swapping. My next test is going to be disabling swap on the node.

dtrudg commented 4 years ago

Going back through this a bit I noticed I missed in the first post...

settled out to using ~80GB of physical memory and 48GB of swap

... being suspiciously equal to the node total ...

128GB of memory (~120GB usable)

As if the process had a view that the total RAM on the machine was 128GB and entirely available to it, so it could expand to that. Not clear exactly what triggers it to swap way before the 120GB usable is hit... but it's something a bit odd to look at.

Are you able to post exact detail of the cgroup config that Torque is enforcing for the job, or if you are setting manually when trying to debug?

Then - if you submit a job which doesn't run the real workload, but does singularity exec cat /proc/self/cgroup - do you get the expected output? Any non-default singularity config? Is it running in setuid mode or using user namespace?

tabaer commented 4 years ago

Disabling swap looks a lot like vm.swappiness=0; the processes get into the unkillable D state, and the OOM killer can't kill them, and the only way to get the node back in a sane state is to reboot it.

dtrudg commented 4 years ago

Disabling swap looks a lot like vm.swappiness=0; the processes get into the unkillable D state, and the OOM killer can't kill them, and the only way to get the node back in a sane state is to reboot it.

Which processes is it in this case? TADbit / trinity in Singularity / the memory hog? Can you provide the command line or the Torque job?

Looking at the trinity case it appears the butterfly portion of trinity is going to expect to allocate a 10GB Java heap for each CPU it's told to use

https://trinityrnaseq.github.io/performance/mem.html

Wondering what --CPU set to for the run there, and if early swapping in at least the trinity case could be related to how massive amounts of Java heap is being allocated? Just trying to pin down some kind of specific area to think about first.

How is trinity being run compared to their Singularity example?

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-in-Docker#trinity_singularity

tabaer commented 4 years ago

Disabling swap looks a lot like vm.swappiness=0; the processes get into the unkillable D state, and the OOM killer can't kill them, and the only way to get the node back in a sane state is to reboot it.

Which processes is it in this case?

Sorry, it's the TADbit Python case. The job executes it like this:

singularity -d exec tadbit.sif python WT_Compartments_and_TADs_f_WT.py

The Python script executed above is from the user.

if you submit a job which doesn't run the real workload, but does singularity exec cat /proc/self/cgroup - do you get the expected output?

Yes, it looks the same AFAICT:

troy@owens-login02:/fs/scratch/PZS0708/troy/WT_Tcell_matrix$ qsub -I -l nodes=1:ppn=28,mem=118gb

qsub: waiting for job 9369100.owens-batch.ten.osc.edu to start
qsub: job 9369100.owens-batch.ten.osc.edu ready

troy@o0109:~$ cd $PBS_O_WORKDIR

troy@o0109:/fs/scratch/PZS0708/troy/WT_Tcell_matrix$ cat /proc/self/cgroup
11:devices:/torque/9369100.owens-batch.ten.osc.edu
10:pids:/system.slice/pbs_mom.service
9:cpuset:/torque/9369100.owens-batch.ten.osc.edu
8:net_prio,net_cls:/
7:perf_event:/
6:blkio:/system.slice/pbs_mom.service
5:cpuacct,cpu:/torque/9369100.owens-batch.ten.osc.edu
4:memory:/torque/9369100.owens-batch.ten.osc.edu
3:freezer:/
2:hugetlb:/
1:name=systemd:/system.slice/pbs_mom.service

troy@o0109:/fs/scratch/PZS0708/troy/WT_Tcell_matrix$ singularity exec tadbit.sif cat /proc/self/cgroup
11:devices:/torque/9369100.owens-batch.ten.osc.edu
10:pids:/system.slice/pbs_mom.service
9:cpuset:/torque/9369100.owens-batch.ten.osc.edu
8:net_prio,net_cls:/
7:perf_event:/
6:blkio:/system.slice/pbs_mom.service
5:cpuacct,cpu:/torque/9369100.owens-batch.ten.osc.edu
4:memory:/torque/9369100.owens-batch.ten.osc.edu
3:freezer:/
2:hugetlb:/
1:name=systemd:/system.slice/pbs_mom.service

Any non-default singularity config? Is it running in setuid mode or using user namespace?

AFAIK, we're not doing anything clever in our Singularity configs other than bind-mounting our GPFS file systems and a couple config files. I believe it's using user namespaces, but I will have to verify that with @treydock.

BTW, I can't tell if this is relevant or not, but once these processes get stuck in the D state, there are messages in syslog about them where the kernel stack traces mention squashfs, which is presumably related to accessing the container:

Feb 13 16:26:41 o0448 kernel: INFO: task python:176153 blocked for more than 120
 seconds.
Feb 13 16:26:41 o0448 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
 disables this message.
Feb 13 16:26:41 o0448 kernel: python          D ffff9da4a94d8000     0 176153 14
7944 0x00100004
Feb 13 16:26:41 o0448 kernel: Call Trace:
Feb 13 16:26:41 o0448 kernel: [<ffffffff84d60368>] ? queued_spin_lock_slowpath+0
xb/0xf
Feb 13 16:26:41 o0448 kernel: [<ffffffff84d6b8b9>] schedule+0x29/0x70
Feb 13 16:26:41 o0448 kernel: [<ffffffffc0ca5735>] squashfs_cache_get+0x105/0x3c
0 [squashfs]
Feb 13 16:26:41 o0448 kernel: [<ffffffffc0ca5ed8>] ? squashfs_read_metadata+0x58
/0x130 [squashfs]
Feb 13 16:26:41 o0448 kernel: [<ffffffff846c4280>] ? wake_up_atomic_t+0x30/0x30
Feb 13 16:26:41 o0448 kernel: [<ffffffffc0ca6001>] squashfs_get_datablock+0x21/0
x30 [squashfs]
Feb 13 16:26:41 o0448 kernel: [<ffffffffc0ca7262>] squashfs_readpage+0x882/0xbe0
 [squashfs]
Feb 13 16:26:41 o0448 kernel: [<ffffffff847c54b8>] __do_page_cache_readahead+0x2
48/0x260
Feb 13 16:26:41 o0448 kernel: [<ffffffff847c5aa1>] ra_submit+0x21/0x30
Feb 13 16:26:41 o0448 kernel: [<ffffffff847ba575>] filemap_fault+0x105/0x490
Feb 13 16:26:41 o0448 kernel: [<ffffffff848353be>] ? mem_cgroup_reclaim+0x4e/0x1
20
Feb 13 16:26:41 o0448 kernel: [<ffffffff847e618a>] __do_fault.isra.59+0x8a/0x100
Feb 13 16:26:41 o0448 kernel: [<ffffffff847e673c>] do_read_fault.isra.61+0x4c/0x
1b0
Feb 13 16:26:41 o0448 kernel: [<ffffffff847eb0c4>] handle_pte_fault+0x2f4/0xd10
Feb 13 16:26:41 o0448 kernel: [<ffffffff847edbfd>] handle_mm_fault+0x39d/0x9b0
Feb 13 16:26:41 o0448 kernel: [<ffffffff84d73623>] __do_page_fault+0x203/0x4f0
Feb 13 16:26:41 o0448 kernel: [<ffffffff84d73945>] do_page_fault+0x35/0x90
Feb 13 16:26:41 o0448 kernel: [<ffffffff84d6fb7f>] ? error_exit+0x1f/0x60
Feb 13 16:26:41 o0448 kernel: [<ffffffff84d6f778>] page_fault+0x28/0x30

dtrudg commented 4 years ago

Thanks for this information.

If the container is on GPFS then I think the squashfs dmesg stuff makes sense. From past experience this certainly sounds like the situation where the processes are stuck uninterruptible as they are trying to do I/O and the GPFS daemons / kernel stuff can't satisfy the I/O due to there being no RAM available on the machine for them to work properly.

Given the cgroup limits are visible from a container process as they should be, at this point I'm not clear that there's any Singularity related cause of this behavior, without it being pointed to by the same TADbit or Trinity jobs succeeding, with the same job limits, if they are not run in a container. Has that been confirmed?

I don't think we can go much further into troubleshooting without getting a lot more specific information about the exact jobs that are failing, and state at the time they fail, including things like:

What's the GPFS pagepool size on the nodes?
Does slabtop show large allocations related to I/O at problematic times?
If you set the cgroups memory limit << RAM (e.g. 96GB) do you observe proper execution or functional cgroup OOM kills?
What are the containers - provide the definition file or public source?
What is the user's python code doing? One thought about triggering early swapping is that there might be something loading very large data into a structure that is grown dynamically with an exponential allocation increase... with a huge jump in allocation triggering the swapping.
What is the stdout/stderr when singularity -d ..... is used to run the job and you hit these issue?
Which exact executable(s) (process tree) are running at the time the issue is hit (since TADbit and trinity flows can both invoke various programs.
A snapshot process listing at a problematic time, showing RES/VIRT etc. allocations.

Cheers.

tabaer commented 4 years ago

What's the GPFS pagepool size on the nodes?

On the cluster where the TADbit cases have been seen (Owens), it's 1.5 GB out of total physical memory of 128GB. On the cluster where the Trinity case was observed (Pitzer), it's 3 GB out of total physical memory of 192 GB.

If you set the cgroups memory limit << RAM (e.g. 96GB) do you observe proper execution or functional cgroup OOM kills?

No. That's actually how we found this issue in the first place -- one of these TADbit Python jobs asked for nodes=1:ppn=24,mem=64gb and ended up ~12GB into swap with ~60GB of memory free on the node. It turned out that that particular case actually needed a memory limit more like 70GB to keep from swapping. With the second TADbit Python case that I've been using today (the one that actually needs 132GB total), if I run it asking for nodes=1:ppn=28,mem=64gb (i.e. about half of physical memory, much less than it needs), it will go 48GB into swap in a couple minutes while there's >100GB of memory free.

A snapshot process listing at a problematic time, showing RES/VIRT etc. allocations.

During the TADbit Python case I mentioned above that requested 64gb when it needs more like 132gb:

# ps auxwf
[...system processes removed for brevity...]
root      25428  0.0  0.0 389624 84428 ?        SLsl Feb03  12:28 /opt/torque/sbin/pbs_mom -d /var/spool/torque -H o0116
6624       9970  0.0  0.0 125868   860 ?        Ss   17:51   0:00  \_ -bash /var/spool/torque/mom_priv/jobs/9369472.owens-ba.
6624      10144  3.1  0.0 147952  2316 ?        S    17:51   0:45      \_ /usr/bin/python ./memmon
6624      10145  0.0  0.0 115372   760 ?        S    17:51   0:00      \_ /bin/bash ./swapmon
6624      10154  2.2  0.0 112288   808 ?        S    17:51   0:32      |   \_ sar -B 1 14400
6624      10156  3.2  0.0 117468   988 ?        S    17:51   0:45      |       \_ sadc 1 14401 -z
6624      10146  0.1  0.0 116896  1032 ?        S    17:51   0:02      \_ /bin/bash ./slabmon
root      15682  0.0  0.0  19756   344 ?        D    18:15   0:00      |   \_ sudo /bin/cat /proc/slabinfo
6624      10147  0.5  0.0 420960   752 ?        Sl   17:51   0:07      \_ Singularity runtime parent
6624      10224  4.3  0.2 3402396 279500 ?      Sl   17:51   1:02          \_ python WT_Compartments_and_TADs_f_WT.py
6624      11100  4.3  1.0 4957852 1365872 ?     D    17:52   0:59              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11101  4.3  1.1 4858524 1476348 ?     D    17:52   0:58              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11102  4.1  0.2 3815068 347864 ?      S    17:52   0:57              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11103  4.3  1.0 4898972 1363504 ?     D    17:52   0:58              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11104  4.0  0.4 4075420 590392 ?      D    17:52   0:54              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11105  4.2  0.9 4922780 1285884 ?     D    17:52   0:58              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11106  3.7  0.1 3570588 219844 ?      S    17:52   0:50              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11107  4.2  0.1 4821916 232084 ?      D    17:52   0:57              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11108  4.2  0.1 4811164 192888 ?      D    17:52   0:57              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11109  3.7  0.1 3676828 212160 ?      S    17:52   0:51              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11110  4.2  0.9 4855708 1221716 ?     D    17:52   0:57              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11111  4.1  0.3 4415388 503028 ?      D    17:52   0:56              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11112  4.1  0.9 4851612 1282192 ?     D    17:52   0:56              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11113  3.9  1.0 4922268 1336456 ?     D    17:52   0:53              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11114  4.0  0.1 4852892 258196 ?      D    17:52   0:55              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11115  4.0  0.5 4071324 783204 ?      D    17:52   0:55              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11116  4.0  0.7 4826012 924024 ?      D    17:52   0:55              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11117 93.9  0.2 4014748 360652 ?      R    17:52  21:16              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11118  4.0  0.9 4818844 1288868 ?     D    17:52   0:54              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11119  3.5  0.2 3616412 300944 ?      S    17:52   0:48              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11120  3.9  0.1 4928156 191964 ?      D    17:52   0:54              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11121  3.7  0.1 3633820 212124 ?      S    17:52   0:50              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11122  3.9  1.0 4957852 1336356 ?     D    17:52   0:53              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11123  3.3  0.1 3753116 214172 ?      S    17:52   0:45              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11124  3.8  1.0 4923292 1447716 ?     D    17:52   0:52              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11125  3.8  0.2 3986076 351304 ?      D    17:52   0:52              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11127  3.8  0.4 4893596 564372 ?      D    17:52   0:51              \_ python WT_Compartments_and_TADs_f_WT.py
6624      11129  3.3  0.1 3675804 213036 ?      S    17:52   0:45              \_ python WT_Compartments_and_TADs_f_WT.py
[...system processes removed for brevity...]

# free
              total        used        free      shared  buff/cache   available
Mem:      131912884    20336556   110194980      288324     1381348   110325260
Swap:      50331644    50331644           0

Again, keep in mind that it is ~48GB into swap while there is >100GB of memory free.

More data coming tomorrow. (I have slabinfo data, but I'm not really sure how to interpret it.)

dtrudg commented 4 years ago

What's the GPFS pagepool size on the nodes?

On the cluster where the TADbit cases have been seen (Owens), it's 1.5 GB out of total physical memory of 128GB. On the cluster where the Trinity case was observed (Pitzer), it's 3 GB out of total physical memory of 192 GB.

Okay - so that's not huge and wouldn't have a material impact here.

If you set the cgroups memory limit << RAM (e.g. 96GB) do you observe proper execution or functional cgroup OOM kills?

No. That's actually how we found this issue in the first place -- one of these TADbit Python jobs asked for nodes=1:ppn=24,mem=64gb and ended up ~12GB into swap with ~60GB of memory free on the node. It turned out that that particular case actually needed a memory limit more like 70GB to keep from swapping. With the second TADbit Python case that I've been using today (the one that actually needs 132GB total), if I run it asking for nodes=1:ppn=28,mem=64gb (i.e. about half of physical memory, much less than it needs), it will go 48GB into swap in a couple minutes while there's >100GB of memory free.

So this makes me think things are working properly. If you set mem=64gb and that's enforced by cgroups then it should start going into swap when it's going to reach that limit, which would result in ~60GB free on a 128GB node?

https://jvns.ca/blog/2017/02/17/mystery-swap/

My model of memory limits on cgroups was always “if you use more than X memory, you will get killed right away”. It turns out that that assumptions was wrong! If you use more than X memory, you can still use swap!

And apparently some kernels also support setting separate swap limits. So you could set your memory limit to X and your swap limit to 0, which would give you more predictable behavior. Swapping is weird and confusing.

The fact that with mem=64gb the 1st TADbit job goes 12GB into swap, and the second one goes 48GB into swap just seems to me that they ultimately require more than the cgroup memory limit, but the cgroup is allowing swap. TADbit is probably working on some very large structure that is trying doing a huge allocation in one step - leading to them swapping before you see them 'using' the 64GB specified in the mem/cgroup limit.

Again, keep in mind that it is ~48GB into swap while there is >100GB of memory free.

The >100GB free is not likely relevant - everything is related to the --mem 64GB limit imposed with cgroups. The process will swap if it has hit that / is trying to do something that will exceed that, regardless of how much is free on the host.

Looking around a bit I came across this:

https://github.com/adaptivecomputing/torque/issues/372

And going on a bit from there and searching around, it appears using --mem only sets memory.limit_in_bytes - swap is separate under cgroups, so the jobs will swap. You'd need to use --vmem at job submissions which sets both memory.limit_in_bytes and memory.memsw.limit_in_bytes to be translated into a job cgroup to stop the swapping, and see OOM kills as expected.

tabaer commented 4 years ago

going on a bit from there and searching around, it appears using --mem only sets memory.limit_in_bytes - swap is separate under cgroups, so the jobs will swap. You'd need to use --vmem at job submissions which sets both memory.limit_in_bytes and memory.memsw.limit_in_bytes to be translated into a job cgroup to stop the swapping, and see OOM kills as expected.

That's not the case in our environment -- as I mentioned at the beginning of the issue, memory.limit_in_bytes and memory.memsw.limit_in_bytes are identical. (That's fixed up by our job prologue.) For instance, in the TADbit Python job whose process list I posted last night that requests mem=64gb but needs ~132gb, I have it print out the cgroup limit_in_bytes files before it runs Singularity:

memory.limit_in_bytes=68719476736
memory.memsw.limit_in_bytes=68719476736

One of the monitoring processes I've added to these jobs is a Python script that logs several of the values in the memory cgroup once a second, including memory.usage _in_bytes and memory.memsw.usage_in_bytes. What I see with these jobs is that the former will shoot up to the limit and then drop to limit-swap (~16GB in the TADbit Python case that requests mem=64gb but needs ~132gb), but neither limit is ever actually exceeded and so the OOM killer never fires. It's very strange.

And again, we've only seen this behavior with these two classes of Singularity jobs. Other memory-intensive jobs are cleaned up by the OOM killer as expected.

dtrudg commented 4 years ago

That's not the case in our environment -- as I mentioned at the beginning of the issue, memory.limit_in_bytes and memory.memsw.limit_in_bytes are identical. (That's fixed up by our job prologue.)

Apologies - I'd also come across the OSC docs here which led me to think they wouldn't be, as it mentions the mem vs vmem stuff - https://www.osc.edu/documentation/knowledge_base/out_of_memory_oom_or_excessive_memory_usage

And again, we've only seen this behavior with these two classes of Singularity jobs. Other memory-intensive jobs are cleaned up by the OOM killer as expected.

Can you specifically confirm if these TADbit and Trinity jobs have been run outside of Singularity, in a straight conda environment or something similar, and that the behaviour is not the same?

Myself and @cclerget are at a loss regarding how Singularity could impact this, given the cgroups are showing up correctly when you did the `singularity exec cat /proc/self/cgroup

What I see with these jobs is that the former will shoot up to the limit and then drop to limit-swap (~16GB in the TADbit Python case that requests mem=64gb but needs ~132gb), but neither limit is ever actually exceeded and so the OOM killer never fires. It's very strange.

This is indeed very strange. Wildly speculating It feels a bit like these jobs are doing something very odd memory management wise to ride right up to the limit of what is available... and there is a weird interaction with the cgroup limits.

tabaer commented 4 years ago

Can you specifically confirm if these TADbit and Trinity jobs have been run outside of Singularity, in a straight conda environment or something similar, and that the behaviour is not the same?

I have no way to do that myself. I will ask the users if they've ever tried it, but my suspicion is no.

dtrudg commented 4 years ago

Can you specifically confirm if these TADbit and Trinity jobs have been run outside of Singularity, in a straight conda environment or something similar, and that the behaviour is not the same?

I have no way to do that myself. I will ask the users if they've ever tried it, but my suspicion is no.

Thanks. I'm out of ideas here really, other than a non-Singularity specific strange interaction between the way these applications are dealing with memory and the cgroup limits.

Given that the singularity exec cat /proc/self/cgroup shows the expected output then I don't understand how Singularity could be influencing the behavior. I've chatted to @cclerget also and as far as he is aware once the pid of a process is set in cgroups there is no way for the process and all its descendants to escape it unless there is a kernel bug somewhere.

I've changed the title of the issue to specify "Tadbit" and "Trinity" in case it helps anyone driving-by who might have come across something to notice this issue.

cclerget commented 4 years ago

@tabaer Have you checked that memory.use_hierarchy is set to 1 for the job cgroup ?

tabaer commented 4 years ago

@tabaer Have you checked that memory.use_hierarchy is set to 1 for the job cgroup ?

Yes, it is:

memory.use_hierarchy=1

cclerget commented 4 years ago

@tabaer And what if adjust the limit memory.memsw.limit_in_bytes to be saying 8G above the memory.limit_in_bytes, maybe that triggered some strange path in kernel by setting them to identical values.

Also have you checked that running a memory stress app can trigger OOM killer from inside a singularity container because I see mention of consumer tests but was it done inside a Singularity container ?

tabaer commented 4 years ago

Also have you checked that running a memory stress app can trigger OOM killer from inside a singularity container because I see mention of consumer tests but was it done inside a Singularity container ?

I just tried running my memory consumer test job inside the TADbit container. It appeared to be killed by the OOM killer as expected with minimal swapping.

cclerget commented 4 years ago

I just tried running my memory consumer test job inside the TADbit container. It appeared to be killed by the OOM killer as expected with minimal swapping.

It confirms the issue lies in the application itself (and maybe in conjunction with a cgroup issue ?), I would suggest to test other versions older or newer whatever of those apps to see if it helps.

dtrudg commented 4 years ago

I'm going to close this issue now, as we got to the point of having decent confidence this is something specific to the application (likely its interaction with cgroups limits), rather than Singularity.

Please don't hesitate to re-open if any further troubleshooting points to a different conclusion. Thanks.