yydzhou commented 5 years ago

Describe the bug

When using cephfs FUSE mounter in cephfs csi, the cephcsi-cephfs nodeplugin pod is very easy to get killed due to OOM. I have tried with 256M and then 1G resource setting, but the issue still happens.

A clear and concise description of what the bug is.

Environment details

Image/version of Ceph CSI driver image: quay.io/cephcsi/cephfsplugin:v1.0.0
helm chart version
Kubernetes cluster version 1.13.4
Logs

Steps to reproduce

Steps to reproduce the behavior: deploying ceph and then cephfs csi driver + storageclass (mounter: fuse). Then create multiple cephfs pv/pvc and consume them with pods.

Setup details: '...'
Deployment to trigger the issue '....'
See error

Actual results

Describe what happened [167475.834265] Hardware name: RDO OpenStack Compute, BIOS 1.11.0-2.el7 04/01/2014 [167475.837017] Call Trace: [167475.838397] [<ffffffffb390e78e>] dump_stack+0x19/0x1b [167475.840546] [<ffffffffb390a110>] dump_header+0x90/0x229 [167475.842691] [<ffffffffb34d805b>] ? cred_has_capability+0x6b/0x120 [167475.845070] [<ffffffffb3397c44>] oom_kill_process+0x254/0x3d0 [167475.847349] [<ffffffffb34d813e>] ? selinux_capable+0x2e/0x40 [167475.849595] [<ffffffffb340f326>] mem_cgroup_oom_synchronize+0x546/0x570 [167475.852109] [<ffffffffb340e7a0>] ? mem_cgroup_charge_common+0xc0/0xc0 [167475.854595] [<ffffffffb33984d4>] pagefault_out_of_memory+0x14/0x90 [167475.856997] [<ffffffffb3908232>] mm_fault_error+0x6a/0x157 [167475.859178] [<ffffffffb391b8c6>] __do_page_fault+0x496/0x4f0 [167475.861395] [<ffffffffb391ba06>] trace_do_page_fault+0x56/0x150 [167475.863692] [<ffffffffb391af92>] do_async_page_fault+0x22/0xf0 [167475.865966] [<ffffffffb39177b8>] async_page_fault+0x28/0x30 [167475.868162] Task in /kubepods/burstable/pod6c6d1804-bd30-11e9-9fb1-fa163e63ddad/725f282aadbd6b6c9b540a8d64ce719560cca6dd9db685e52538f4012f08061f killed as a result of limit of /kubepods/burstable/pod6c6d1804-bd30-11e9-9fb1-fa163e63ddad/725f282aadbd6b6c9b540a8d64ce719560cca6dd9db685e52538f4012f08061f [167475.877505] memory: usage 1048576kB, limit 1048576kB, failcnt 54 [167475.879928] memory+swap: usage 1048576kB, limit 1048576kB, failcnt 0 [167475.882395] kmem: usage 10440kB, limit 9007199254740988kB, failcnt 0 [167475.884797] Memory cgroup stats for /kubepods/burstable/pod6c6d1804-bd30-11e9-9fb1-fa163e63ddad/725f282aadbd6b6c9b540a8d64ce719560cca6dd9db685e52538f4012f08061f: cache:0KB rss:1038136KB rss_huge:51200KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:1038064KB inactive_file:0KB active_file:0KB unevictable:0KB [167475.975145] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [167475.977958] [30026] 0 30026 32644 3917 24 0 994 cephcsi-cephfs [167475.980763] [32479] 0 32479 407518 83034 252 0 994 ceph-fuse [167475.983480] [19628] 0 19628 525737 176645 467 0 994 ceph-fuse [167475.986162] Memory cgroup out of memory: Kill process 23228 (ceph-fuse) score 1648 or sacrifice child [167475.989086] Killed process 19628 (ceph-fuse) total-vm:2102948kB, anon-rss:700684kB, file-rss:5896kB, shmem-rss:0kB

Expected behavior

A clear and concise description of what you expected to happen. The nodeplugin should be more stable and have option to set how much memory to use.

Additional context

Add any other context about the problem here.

For example:

Any existing bug report which describe about the similar issue/behavior

ShyamsundarR commented 5 years ago

@yydzhou Is the test to only create pods that consume the PVCs or do the pods also perform any IO or filesystem operation?

cc @ajarr can you help understand the behavior here?

yydzhou commented 5 years ago

Yes, the pods will also perform I/O and filesystem operations.

Madhu-1 commented 5 years ago

@ajarr @poornimag PTAL

ajarr commented 5 years ago

@yydzhou , what's the version of Ceph cluster and FUSE client (ceph-fuse)? Exact version would be helpful to know, 14.2.x?

After how many PVCs consumed by the pods and doing I/Os, do you hit OOM?

ajarr commented 5 years ago

@ShyamsundarR , are 256 M and 1G typical memory limit setting for CephFS and RBD node plugins?

ShyamsundarR commented 5 years ago

@ShyamsundarR , are 256 M and 1G typical memory limit setting for CephFS and RBD node plugins?

I am not aware of the memory constraints or usage of RBD. I think we need to understand this better.

For example, switching to kernel cephfs and/or krbd instead of rbd-nbd may not charge the container namespace (in this case the nodeplugin) the memory overhead and further would let the kernel manage space reclamation based on usage. I think this may be a better direction in the longer run (unless I understood rbd-ndb incorrectly, i.e where the blocks are cached).

The issue may come down to, how to let cephfs FUSE know how much space it has to operate with cached data (beyond the usual must-have memory footprint) and if we can share this across the various FUSE mounts, or we need to think about a single FUSE mount to control this consumption (i.e #476 ).

cc @dillaman

yydzhou commented 5 years ago

@ajarr The ceph version is 13.2.4. The OOM happens when mounting the 3rd pv on the node. I am using cephfs-csi chart release 1.0. So I assume the ceph.fuse version is 14.2.x? I think Once the pv is mounted the related pods will issue some I/O against it. But not that much. Because the failure happens during the deployment of our cluster.

yydzhou commented 5 years ago

JFYI, Using of ceph.FUSE mounting is to support centos with old kernel (3.10). The external cephfs provisioner has an option to disable RADOS pool namespace ioslation to allow old kernel mount (ref https://github.com/kubernetes-incubator/external-storage/commit/4fefaf622ecfa58508629bfd2b3eb11de101599f#diff-3ccb4687fb599e0453570308087f8252). But seems the cephfs-csi does not support that. So we have to use FUSE mounting, before we upgraded all our kernel version.

ajarr commented 5 years ago

@ShyamsundarR , are 256 M and 1G typical memory limit setting for CephFS and RBD node plugins?

I am not aware of the memory constraints or usage of RBD. I think we need to understand this better.

For example, switching to kernel cephfs and/or krbd instead of rbd-nbd may not charge the container namespace (in this case the nodeplugin) the memory overhead and further would let the kernel manage space reclamation based on usage. I think this may be a better direction in the longer run (unless I understood rbd-ndb incorrectly, i.e where the blocks are cached).

The issue may come down to, how to let cephfs FUSE know how much space it has to operate with cached data (beyond the usual must-have memory footprint) and if we can share this across the various FUSE mounts, or we need to think about a single FUSE mount to control this consumption (i.e #476 ).

cc @dillaman

@ batrick FYI

ajarr commented 5 years ago

@ajarr The ceph version is 13.2.4. The OOM happens when mounting the 3rd pv on the node.

This is surprising. Not sure how other CephFS CSI v1.0.0 users didn't hit this.

I am using cephfs-csi chart release 1.0. So I assume the ceph.fuse version is 14.2.x? I think Once the pv is mounted the related pods will issue some I/O against it. But not that much. Because the failure happens during the deployment of our cluster.

This is helpful information. I'll try tracking down the issue.

Madhu-1 commented 5 years ago

even I have seen this issue with more I/O the memory consumption of fuse increases. @poornimag can you confirm it.

ajarr commented 5 years ago

@joscollin can you also take a look?

rochaporto commented 4 years ago

Any news on this one? We're seeing the same issue, even in nodes with no PVs mounted after a couple days they go OOM. Dropping the resource requests the nodes eventually die.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

ceph / ceph-csi

cephcsi-cephfs nodeplugin pod got OOM killing when using FUSE mounter #554

Describe the bug

Environment details

Steps to reproduce

Actual results

Expected behavior

Additional context