coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
147 stars 30 forks source link

Failing transient systemd mount units #2379

Open redbaron opened 6 years ago

redbaron commented 6 years ago

Issue Report

Bug

When stress testing our iSCSI setup I created tens of pods with dynamically provisioned PVs and removed them in a loop. Systemd transient mount errors start to accumulate:

Failed Units: 12
  var-lib-kubelet-plugins-kubernetes.io-iscsi-iface\x2ddefault-10.1.72.201:3260\x2diqn.1992\x2d08.com.netapp:sn.440d6a6ad6a311e79b3500a0981e9162:vs.13\x2dlun\x2d26.mount
  var-lib-kubelet-plugins-kubernetes.io-iscsi-iface\x2ddefault-10.1.72.201:3260\x2diqn.1992\x2d08.com.netapp:sn.440d6a6ad6a311e79b3500a0981e9162:vs.13\x2dlun\x2d36.mount
  var-lib-kubelet-plugins-kubernetes.io-iscsi-iface\x2ddefault-10.1.72.201:3260\x2diqn.1992\x2d08.com.netapp:sn.440d6a6ad6a311e79b3500a0981e9162:vs.13\x2dlun\x2d42.mount
  var-lib-kubelet-plugins-kubernetes.io-iscsi-iface\x2ddefault-10.1.72.201:3260\x2diqn.1992\x2d08.com.netapp:sn.440d6a6ad6a311e79b3500a0981e9162:vs.13\x2dlun\x2d43.mount
  var-lib-kubelet-plugins-kubernetes.io-iscsi-iface\x2ddefault-10.1.72.201:3260\x2diqn.1992\x2d08.com.netapp:sn.440d6a6ad6a311e79b3500a0981e9162:vs.13\x2dlun\x2d49.mount
  var-lib-kubelet-plugins-kubernetes.io-iscsi-iface\x2ddefault-10.1.72.201:3260\x2diqn.1992\x2d08.com.netapp:sn.440d6a6ad6a311e79b3500a0981e9162:vs.13\x2dlun\x2d50.mount
  var-lib-kubelet-pods-208918f0\x2d2223\x2d11e8\x2d9e43\x2d005056bced2e-volumes-kubernetes.io\x7eiscsi-dev\x2dapps\x2dsigpage\x2ddev\x2dtest\x2dabc\x2d10\x2d1afbal\x2dpostgresql\x2d2085e.mount
  var-lib-kubelet-pods-2704c935\x2d2223\x2d11e8\x2d9e43\x2d005056bced2e-volumes-kubernetes.io\x7eiscsi-dev\x2dapps\x2dsigpage\x2ddev\x2dtest\x2dabc\x2d13\x2dm07prw\x2dminio\x2d27015.mount
  var-lib-kubelet-pods-45107468\x2d2223\x2d11e8\x2d9e43\x2d005056bced2e-volumes-kubernetes.io\x7eiscsi-dev\x2dapps\x2dsigpage\x2ddev\x2dtest\x2dabc\x2d16\x2dv2w5ms\x2dpostgresql\x2d450c6.mount
  var-lib-kubelet-pods-452a7678\x2d2223\x2d11e8\x2d9e43\x2d005056bced2e-volumes-kubernetes.io\x7eiscsi-dev\x2dapps\x2dsigpage\x2ddev\x2dtest\x2dabc\x2d16\x2dv2w5ms\x2dminio\x2d4526b.mount
  var-lib-kubelet-pods-c32f9473\x2d2222\x2d11e8\x2d9e43\x2d005056bced2e-volumes-kubernetes.io\x7eiscsi-dev\x2dapps\x2dsigpage\x2ddev\x2dtest\x2dabc\x2d6\x2d86xc2v\x2dpostgresql\x2dc32c2.mount
  var-lib-kubelet-pods-f14e704b\x2d2222\x2d11e8\x2d9e43\x2d005056bced2e-volumes-kubernetes.io\x7eiscsi-dev\x2dapps\x2dsigpage\x2ddev\x2dtest\x2dabc\x2d9\x2d71kgni\x2dpostgresql\x2df14b7.mount
systemctl status --no-pager -l 'var-lib-kubelet-plugins-kubernetes.io-iscsi-iface\x2ddefault-10.1.72.201:3260\x2diqn.1992\x2d08.com.netapp:sn.440d6a6ad6a311e79b3500a0981e9162:vs.13\x2dlun\x2d42.mount'           
● var-lib-kubelet-plugins-kubernetes.io-iscsi-iface\x2ddefault-10.1.72.201:3260\x2diqn.1992\x2d08.com.netapp:sn.440d6a6ad6a311e79b3500a0981e9162:vs.13\x2dlun\x2d42.mount - /var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/10.1.72.201:3260-iqn.1992-08.com.netapp:sn.440d6a6ad6a311e79b3500a0981e9162:vs.13-lun-42
   Loaded: loaded (/proc/self/mountinfo)
   Active: failed (Result: resources)
    Where: /var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/10.1.72.201:3260-iqn.1992-08.com.netapp:sn.440d6a6ad6a311e79b3500a0981e9162:vs.13-lun-42

Mar 07 16:32:19 erepnk13 systemd[1]: var-lib-kubelet-plugins-kubernetes.io-iscsi-iface\x2ddefault-10.1.72.201:3260\x2diqn.1992\x2d08.com.netapp:sn.440d6a6ad6a311e79b3500a0981e9162:vs.13\x2dlun\x2d42.mount: Failed to run 'mount' task: No such file or directory
Mar 07 16:32:19 erepnk13 systemd[1]: var-lib-kubelet-plugins-kubernetes.io-iscsi-iface\x2ddefault-10.1.72.201:3260\x2diqn.1992\x2d08.com.netapp:sn.440d6a6ad6a311e79b3500a0981e9162:vs.13\x2dlun\x2d42.mount: Failed with result 'resources'.
Mar 07 16:32:19 erepnk13 systemd[1]: Failed to mount /var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/10.1.72.201:3260-iqn.1992-08.com.netapp:sn.440d6a6ad6a311e79b3500a0981e9162:vs.13-lun-42.

Container Linux Version

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1688.2.0+jenkins2-build-1688-1225
VERSION_ID=1688.2.0
BUILD_ID=jenkins2-build-1688-1225
PRETTY_NAME="Container Linux by CoreOS 1688.2.0+jenkins2-build-1688-1225 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

Baremetal + NetApp ONTAP over iSCSI for PersistentVolumes

Other Information

Looks like https://github.com/kubernetes/kubernetes/issues/57345 which points to https://github.com/systemd/systemd/issues/7798 , maybe consider backporting the patch?

redbaron commented 6 years ago

I just noticed, that all mounts are from iSCSI session, which kubelet logs out from when last PV is removed. Probably it does it too quick, so systemd has no chance to settle, so probably it is entirely different bug, from what I initially thought.

ajeddeloh commented 6 years ago

It looks like that patch is in v237 which is in 1688.2.0, so I'd agree its probably a different bug. I assume the directory it's trying to mount to does in fact exist?

redbaron commented 6 years ago

Something is definitely happen on iSCSI logout :( Tried multiple times, sometimes it ends up in following error in dmesg:

Mar 08 13:22:59 erepnk12 kernel: INFO: task systemd-udevd:7291 blocked for more than 120 seconds.
Mar 08 13:22:59 erepnk12 kernel:       Not tainted 4.14.24-coreos #1
Mar 08 13:22:59 erepnk12 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 08 13:22:59 erepnk12 kernel: systemd-udevd   D    0  7291    819 0x80000186
Mar 08 13:22:59 erepnk12 kernel: Call Trace:
Mar 08 13:22:59 erepnk12 kernel:  ? __schedule+0x294/0x8a0
Mar 08 13:22:59 erepnk12 kernel:  schedule+0x28/0x80
Mar 08 13:22:59 erepnk12 kernel:  io_schedule+0x12/0x40
Mar 08 13:22:59 erepnk12 kernel:  __lock_page+0x103/0x140
Mar 08 13:22:59 erepnk12 kernel:  ? page_cache_tree_insert+0xc0/0xc0
Mar 08 13:22:59 erepnk12 kernel:  truncate_inode_pages_range+0x54c/0x7b0
Mar 08 13:22:59 erepnk12 kernel:  ? smp_call_function_many+0xa4/0x250
Mar 08 13:22:59 erepnk12 kernel:  ? on_each_cpu_mask+0x23/0x60
Mar 08 13:22:59 erepnk12 kernel:  ? __brelse+0x20/0x20
Mar 08 13:22:59 erepnk12 kernel:  ? on_each_cpu_mask+0x23/0x60
Mar 08 13:22:59 erepnk12 kernel:  ? on_each_cpu_cond+0xa0/0xd0
Mar 08 13:22:59 erepnk12 kernel:  __blkdev_put+0x71/0x1f0
Mar 08 13:22:59 erepnk12 kernel:  blkdev_close+0x21/0x30
Mar 08 13:22:59 erepnk12 kernel:  __fput+0xd8/0x220
Mar 08 13:22:59 erepnk12 kernel:  task_work_run+0x8a/0xb0
Mar 08 13:22:59 erepnk12 kernel:  do_exit+0x2e3/0xaf0
Mar 08 13:22:59 erepnk12 kernel:  ? touch_atime+0xc8/0xe0
Mar 08 13:22:59 erepnk12 kernel:  do_group_exit+0x3a/0xa0
Mar 08 13:22:59 erepnk12 kernel:  get_signal+0x269/0x570
Mar 08 13:22:59 erepnk12 kernel:  do_signal+0x36/0x610
Mar 08 13:22:59 erepnk12 kernel:  ? __vfs_read+0xfe/0x150
Mar 08 13:22:59 erepnk12 kernel:  ? __audit_syscall_exit+0x230/0x2b0
Mar 08 13:22:59 erepnk12 kernel:  exit_to_usermode_loop+0x69/0xa0
Mar 08 13:22:59 erepnk12 kernel:  do_syscall_64+0x104/0x120
Mar 08 13:22:59 erepnk12 kernel:  entry_SYSCALL_64_after_hwframe+0x3d/0xa2