Closed vikram077 closed 2 years ago
what other processes are running on the host?
How many CPUs cores are available?
@mheon could it be different podman installations (different graphroots) stepping on each other shm locks?
I don't think so - it shouldn't be possible for two Libpods to allocate the same lock.
More likely, IMO, is we somehow ended up with more than one container/pod/volume using the same lock (probably 0). Does a podman system renumber
help?
what other processes are running on the host?
How many CPUs cores are available?
@mheon could it be different podman installations (different graphroots) stepping on each other shm locks?
How many CPUs cores are available?
[root@overcloud-computesriov-1 ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 88
On-line CPU(s) list: 0-87
Thread(s) per core: 2
Core(s) per socket: 22
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz
BIOS Model name: Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 2799.743
BogoMIPS: 4399.93
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 56320K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d
[root@overcloud-computesriov-1 ~]#
Memory
[root@overcloud-computesriov-1 ~]# free -h
total used free shared buff/cache available
Mem: 377Gi 33Gi 343Gi 8.0Mi 697Mi 342Gi
Swap: 0B 0B 0B
[root@overcloud-computesriov-1 ~]#
what other processes are running on the host?
It is fresh deployment of RHOSP 16.2, crash only happens when qemu
is running in the node else everything works as expected.
Also please check my few findings
We have also observed below logs in our system
[root@overcloud-computesriov-1 ~]# grep panic /var/log/messages
Dec 27 09:59:59 overcloud-computesriov-1 podman[667957]: panic: operation not permitted
Dec 27 09:59:59 overcloud-computesriov-1 podman[667957]: panic(0x56227278c460, 0xc00053b0f0)
Dec 27 09:59:59 overcloud-computesriov-1 podman[667957]: #011/usr/lib/golang/src/runtime/panic.go:1064 +0x545 fp=0xc0003e3d78 sp=0xc0003e3cb0 pc=0x562270efc005
Dec 27 10:00:00 overcloud-computesriov-1 systemd-coredump[668152]: Process 667957 (podman) of user 0 dumped core.#012#012Stack trace of thread 668015:#012#0 0x0000562270f351e1 runtime.raise (podman)#012#1 0x0000562270f137b1 runtime.sigfwdgo (podman)#012#2 0x0000562270f11e14 runtime.sigtrampgo (podman)#012#3 0x0000562270f35583 runtime.sigtramp (podman)#012#4 0x00007f0e2e454b20 __restore_rt (libpthread.so.0)#012#5 0x0000562270f351e1 runtime.raise (podman)#012#6 0x0000562270efc6cd runtime.fatalpanic (podman)#012#7 0x0000562270efc005 runtime.gopanic (podman)#012#8 0x0000562271b0bd4f github.com/containers/podman/libpod/lock.(*SHMLock).Unlock (podman)#012#9 0x0000562271e58d98 github.com/containers/podman/libpod.(*Container).StopWithTimeout (podman)#012#10 0x000056227204a3f3 github.com/containers/podman/pkg/domain/infra/abi.(*ContainerEngine).ContainerStop.func1 (podman)#012#11 0x0000562271fad890 github.com/containers/podman/pkg/parallel/ctr.ContainerOp.func1 (podman)#012#12 0x0000562271e375d8 github.com/containers/podman/pkg/parallel.Enqueue.func1 (podman)#012#13 0x0000562270f339c1 runtime.goexit (podman)
We also not able to understand why podman restarted the container when I am trying to stop the container, check below outputs,
[root@overcloud-computesriov-1 ~]# podman ps | grep libvirt
4bc4078832a7 manager.ctlplane.example.com:8787/rhosp-rhel8/openstack-nova-libvirt:16.2 kolla_start 10 days ago Up 17 hours ago nova_virtlogd
038a939cedc7 manager.ctlplane.example.com:8787/rhosp-rhel8/openstack-nova-libvirt:16.2 kolla_start 10 days ago Up 17 hours ago nova_libvirt
[root@overcloud-computesriov-1 ~]#
[root@overcloud-computesriov-1 ~]#
[root@overcloud-computesriov-1 ~]# podman stop nova_libvirt
ERRO[0000] Failed to remove paths: map[hugetlb:/sys/fs/cgroup/hugetlb/machine.slice/libpod-038a939cedc7eccd0c05e0372a84a238a823fddc9558e92f0b004aea673368d6.scope name=systemd:/sys/fs/cgroup/systemd/machine.slice/libpod-038a939cedc7eccd0c05e0372a84a238a823fddc9558e92f0b004aea673368d6.scope pids:/sys/fs/cgroup/pids/machine.slice/libpod-038a939cedc7eccd0c05e0372a84a238a823fddc9558e92f0b004aea673368d6.scope]
038a939cedc7eccd0c05e0372a84a238a823fddc9558e92f0b004aea673368d6
# podman restarted here
[root@overcloud-computesriov-1 ~]# podman ps | grep libvirt
4bc4078832a7 manager.ctlplane.example.com:8787/rhosp-rhel8/openstack-nova-libvirt:16.2 kolla_start 10 days ago Up 17 hours ago nova_virtlogd
038a939cedc7 manager.ctlplane.example.com:8787/rhosp-rhel8/openstack-nova-libvirt:16.2 kolla_start 10 days ago Up 3 seconds ago nova_libvirt
[root@overcloud-computesriov-1 ~]# podman stop nova_libvirt
038a939cedc7eccd0c05e0372a84a238a823fddc9558e92f0b004aea673368d6
panic: operation not permitted
goroutine 79 [running]:
panic(0x55b0faf20ba0, 0xc00038f810)
/usr/lib/golang/src/runtime/panic.go:1064 +0x545 fp=0xc0003bdd78 sp=0xc0003bdcb0 pc=0x55b0f969eda5
github.com/containers/podman/libpod/lock.(*SHMLock).Unlock(0xc0003cdcd0)
/builddir/build/BUILD/containers-podman-ad1aaba/_build/src/github.com/containers/podman/libpod/lock/shm_lock_manager_linux.go:121 +0x8f fp=0xc0003bdda8 sp=0xc0003bdd78 pc=0x55b0fa2a310f
:
:
:
/usr/lib/golang/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc0000137d8 sp=0xc0000137d0 pc=0x55b0f96d66c1
created by github.com/containers/podman/vendor/github.com/cri-o/ocicni/pkg/ocicni.initCNI
/builddir/build/BUILD/containers-podman-ad1aaba/_build/src/github.com/containers/podman/vendor/github.com/cri-o/ocicni/pkg/ocicni/ocicni.go:250 +0x3b1
Aborted (core dumped)
[root@overcloud-computesriov-1 ~]# podman ps -a | grep libvirt
4bc4078832a7 manager.ctlplane.example.com:8787/rhosp-rhel8/openstack-nova-libvirt:16.2 kolla_start 10 days ago Up 17 hours ago nova_virtlogd
038a939cedc7 manager.ctlplane.example.com:8787/rhosp-rhel8/openstack-nova-libvirt:16.2 kolla_start 10 days ago stopping nova_libvirt
84e1b8b1aafb manager.ctlplane.example.com:8787/rhosp-rhel8/openstack-nova-libvirt:16.2 /bin/bash -c /usr... 10 days ago Exited (0) 10 days ago nova_libvirt_init_secret
[root@overcloud-computesriov-1 ~]#
After this I can't perform any action on container, check below output,
[root@overcloud-computesriov-1 ~]# podman stop nova_libvirt
Error: can only stop created or running containers. 038a939cedc7eccd0c05e0372a84a238a823fddc9558e92f0b004aea673368d6 is in state stopping: container state improper
[root@overcloud-computesriov-1 ~]#
[root@overcloud-computesriov-1 ~]# podman start nova_libvirt
Error: unable to start container "038a939cedc7eccd0c05e0372a84a238a823fddc9558e92f0b004aea673368d6": container 038a939cedc7eccd0c05e0372a84a238a823fddc9558e92f0b004aea673368d6 must be in Created or Stopped state to be started: container state improper
[root@overcloud-computesriov-1 ~]#
For now reboot is the only solution to bring container in proper state.
Are you sure it's Podman restarting the container? Openstack makes extensive use of systemd-managed containers and I view it as likely that this is one of them, so systemd may have decided to restart the container on detecting it stopping.
The Stopping state issue should be fixed in newer Podmans and backported to the 3.0 stream used by Openstack but I'm not sure when that fix will be arriving.
Regardless, I strongly recommend you open a BZ about this. There are a lot more moving parts here than just Podman (in an Openstack environment, Podman is heavily orchestrated by systemd) and I think we'll need debugging assistance from the OSP team on this.
A friendly reminder that this issue had no activity for 30 days.
It is due to CGroup is not properly cleaning when user stopping the container. So when lib_vir container started again then it tries to create new CGroup and conflicts with the previous one. Issue only occurs when VM is running in that compute node.
systemd-container
In RHOSP 16.1, systemd-machined
is running be default, while in RHOSP 16.2 systemd-machined
is not runinng, so need to start manually.
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
Steps to reproduce the issue:
NOTE: Issue only occurs when VM is in Active state in compute node, else there is no issue. Issue is also reproducible after podman update.
Describe the results you received:
Crash logs
Describe the results you expected:
Podman should not be crashed and work properly
Additional information you deem important (e.g. issue happens only occasionally):
Output of
podman version
:Output of
podman info --debug
:Package info (e.g. output of
rpm -q podman
orapt list podman
):Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)
Yes and occuring same.
Additional environment details (AWS, VirtualBox, physical, etc.):
RHOSP 16.2, baremetal environment