cri-o / cri-o

Open Container Initiative-based implementation of Kubernetes Container Runtime Interface
https://cri-o.io
Apache License 2.0
5.18k stars 1.06k forks source link

crio scope failure on teardown with shortlived / pods in error state #3959

Closed jdconti closed 2 years ago

jdconti commented 4 years ago

Description A certain workload is triggering a bug in crio where a short lived / error'd pod is timing out on teardown:

[root@kubeletl055 ~]# systemctl --failed
  UNIT                                                                        LOAD   ACTIVE SUB    DESCRIPTION
● crio-00bdaf258e7b1849d99cd47c739ec576d92de259836956e0b612cd28ec0bca06.scope loaded failed failed libcontainer container 00bdaf258e7b1849d99cd47c739ec576d92de259836956e0b612cd28ec0bca06
● crio-00d42a28bf820424f25c827879fe47b30e6ede432cc899281390f99b356fd177.scope loaded failed failed libcontainer container 00d42a28bf820424f25c827879fe47b30e6ede432cc899281390f99b356fd177
● crio-128f3ce10b417acd55036ebadf858607bbc66cffb3f1bf9fdcd8b1e0e8687aa0.scope loaded failed failed libcontainer container 128f3ce10b417acd55036ebadf858607bbc66cffb3f1bf9fdcd8b1e0e8687aa0
● crio-12e4978ef3ba876fc398cb57be71325da31d54f3f8ad90a2bd9f2d4a67972430.scope loaded failed failed libcontainer container 12e4978ef3ba876fc398cb57be71325da31d54f3f8ad90a2bd9f2d4a67972430
● crio-2ae2535100ba3ce0f9ad5957ad54fa82303feedea7f2b4df66fdfe322f138795.scope loaded failed failed libcontainer container 2ae2535100ba3ce0f9ad5957ad54fa82303feedea7f2b4df66fdfe322f138795

Here's the high-level overview:

[root@kubeletl055 ~]# systemctl status crio-00bdaf258e7b1849d99cd47c739ec576d92de259836956e0b612cd28ec0bca06.scope
● crio-00bdaf258e7b1849d99cd47c739ec576d92de259836956e0b612cd28ec0bca06.scope - libcontainer container 00bdaf258e7b1849d99cd47c739ec576d92de259836956e0b612cd28ec0bca06
   Loaded: loaded (/run/systemd/system/crio-00bdaf258e7b1849d99cd47c739ec576d92de259836956e0b612cd28ec0bca06.scope; static; vendor preset: disabled)
  Drop-In: /run/systemd/system/crio-00bdaf258e7b1849d99cd47c739ec576d92de259836956e0b612cd28ec0bca06.scope.d
           └─50-BlockIOAccounting.conf, 50-CPUAccounting.conf, 50-CPUShares.conf, 50-DefaultDependencies.conf, 50-Delegate.conf, 50-Description.conf, 50-MemoryAccounting.conf, 50-Slice.conf
   Active: failed (Result: timeout) since Thu 2020-07-16 08:58:55 CDT; 31min ago

Jul 16 08:57:20 kubeletl055 systemd[1]: Started libcontainer container 00bdaf258e7b1849d99cd47c739ec576d92de259836956e0b612cd28ec0bca06.
Jul 16 08:57:25 kubeletl055 systemd[1]: Stopping libcontainer container 00bdaf258e7b1849d99cd47c739ec576d92de259836956e0b612cd28ec0bca06.
Jul 16 08:58:55 kubeletl055 systemd[1]: crio-00bdaf258e7b1849d99cd47c739ec576d92de259836956e0b612cd28ec0bca06.scope stopping timed out. Killing.
Jul 16 08:58:55 kubeletl055 systemd[1]: Stopped libcontainer container 00bdaf258e7b1849d99cd47c739ec576d92de259836956e0b612cd28ec0bca06.
Jul 16 08:58:55 kubeletl055 systemd[1]: Unit crio-00bdaf258e7b1849d99cd47c739ec576d92de259836956e0b612cd28ec0bca06.scope entered failed state.

Steps to reproduce the issue:

  1. Using a JFrog pipeline, configure the max number of concurrent agents
  2. Trigger job(s) that would exceed the max number of allowable agents
  3. Observe jfrog error in jfrog-jenkins pod logs:
    2020-07-16 14:15:56.562+0000 [id=32]    INFO    o.c.j.p.k.KubernetesCloud#addProvisionedSlave: Maximum number of concurrently running agent pods (10) reached for Kubernetes Cloud kubernetes, not provisioning: 13 running or pending in namespace interns with Kubernetes labels {jenkins/jfrog-jenkins-jenkins-slave=true}
  4. This results in newly configured pods ending up in Error state:
    default-99drp                           0/1     Error     0          16s
    default-g63f3                           0/1     Error     0          16s
    default-nwvj7                           0/1     Error     0          16s

Describe the results you received: Relevant logs from the start of teardown

Jul 16 08:55:43 kubeletl055 systemd: Stopping libcontainer container e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78.
Jul 16 08:55:43 kubeletl055 kubelet: W0716 08:55:43.683436  141354 pod_container_deletor.go:75] Container "e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78" not found in pod's containers
Jul 16 08:55:48 kubeletl055 crio: 2020-07-16 08:55:47.490 [INFO][221914] plugin.go 496: Extracted identifiers ContainerID="e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78" Node="kubeletl055" Orchestrator="k8s" WorkloadEndpoint="kubeletl055-k8s-default--jpd9t-eth0"
Jul 16 08:55:48 kubeletl055 crio: 2020-07-16 08:55:47.491 [WARNING][221914] k8s.go 444: WorkloadEndpoint does not exist in the datastore, moving forward with the clean up ContainerID="e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78" WorkloadEndpoint="kubeletl055-k8s-default--jpd9t-eth0"
Jul 16 08:55:48 kubeletl055 crio: 2020-07-16 08:55:47.491 [INFO][221914] k8s.go 474: Cleaning up netns ContainerID="e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78"
Jul 16 08:55:48 kubeletl055 crio: 2020-07-16 08:55:47.491 [INFO][221914] network_linux.go 477: veth does not exist, no need to clean up. ContainerID="e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78" ifName="eth0"
Jul 16 08:55:48 kubeletl055 crio: 2020-07-16 08:55:47.491 [INFO][221914] k8s.go 481: Releasing IP address(es) ContainerID="e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78"
Jul 16 08:55:48 kubeletl055 crio: 2020-07-16 08:55:47.491 [INFO][221914] utils.go 168: Calico CNI releasing IP address ContainerID="e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78"
Jul 16 08:55:48 kubeletl055 crio: 2020-07-16 08:55:47.505 [INFO][221995] ipam_plugin.go 302: Releasing address using handleID ContainerID="e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78" HandleID="k8s-pod-network.e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78" Workload="kubeletl055-k8s-default--jpd9t-eth0"
Jul 16 08:55:48 kubeletl055 crio: 2020-07-16 08:55:47.505 [INFO][221995] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78'
Jul 16 08:55:48 kubeletl055 crio: 2020-07-16 08:55:47.519 [WARNING][221995] ipam.go 1234: CAS error for block, retry #0: update conflict: IPAMBlock(7-150-215-0-24) cidr=7.150.215.0/24 handle="k8s-pod-network.e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78"
Jul 16 08:55:48 kubeletl055 crio: 2020-07-16 08:55:47.525 [WARNING][221995] ipam.go 1234: CAS error for block, retry #1: update conflict: IPAMBlock(7-150-215-0-24) cidr=7.150.215.0/24 handle="k8s-pod-network.e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78"
Jul 16 08:55:48 kubeletl055 crio: 2020-07-16 08:55:47.545 [INFO][221995] ipam_plugin.go 314: Released address using handleID ContainerID="e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78" HandleID="k8s-pod-network.e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78" Workload="kubeletl055-k8s-default--jpd9t-eth0"
Jul 16 08:55:48 kubeletl055 crio: 2020-07-16 08:55:47.545 [INFO][221995] ipam_plugin.go 323: Releasing address using workloadID ContainerID="e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78" HandleID="k8s-pod-network.e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78" Workload="kubeletl055-k8s-default--jpd9t-eth0"
Jul 16 08:55:48 kubeletl055 crio: 2020-07-16 08:55:47.547 [INFO][221914] k8s.go 487: Teardown processing complete. ContainerID="e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78"
Jul 16 08:56:38 kubeletl055 kubelet: E0716 08:56:38.167548  141354 fsHandler.go:118] failed to collect filesystem stats - rootDiskErr: could not stat "/var/lib/containers/storage/overlay/f8dd6459b121c7b7cc33abe6eaddee699b7294753d49e2d1c628c3bb7279453d/diff" to get inode usage: stat /var/lib/containers/storage/overlay/f8dd6459b121c7b7cc33abe6eaddee699b7294753d49e2d1c628c3bb7279453d/diff: no such file or directory, extraDiskErr: could not stat "/var/log/pods/interns_default-jpd9t_3f7835e7-e046-45b7-9bb1-17eeca1d0381/e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78.log" to get inode usage: stat /var/log/pods/interns_default-jpd9t_3f7835e7-e046-45b7-9bb1-17eeca1d0381/e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78.log: no such file or directory
Jul 16 08:57:13 kubeletl055 systemd: crio-e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78.scope stopping timed out. Killing.
Jul 16 08:57:13 kubeletl055 systemd: Stopped libcontainer container e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78.
Jul 16 08:57:13 kubeletl055 systemd: Unit crio-e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78.scope entered failed state.

Describe the results you expected: No scope failures

Additional information you deem important (e.g. issue happens only occasionally):

Output of crio --version:

[root@kubeletl055 ~]# crio --version
crio version 1.17.4
commit: "16a05aafdf5d8fb84942ca60497818949b711cbc"

Additional environment details (AWS, VirtualBox, physical, etc.): Physical - k8s 1.17.4

haircommander commented 4 years ago

hey @jdconti thanks for opening the issue

so you're saying you expect the pod to fail, but it's not failing gracefully? does the application ignore sigterm? also, what's the full output of the systemctl --failed i think it'd be helpful to have the full description

jdconti commented 4 years ago

@haircommander it's surely an incorrect configuration on the jfrog-jenkins instance that's resulting in this weird failure mode. However, it's clearly tickling a bug in the crio teardown. I've posted the full output of systemctl --failed above but here it is again with many more scope failures since initially posting:

[root@kubeletl055 ~]# systemctl --failed
  UNIT                                                                        LOAD   ACTIVE SUB    DESCRIPTION
● crio-00bdaf258e7b1849d99cd47c739ec576d92de259836956e0b612cd28ec0bca06.scope loaded failed failed libcontainer container 00bdaf258e7b1849d99cd47c739ec576d92de259836956e0b612cd28ec0bca06
● crio-00d42a28bf820424f25c827879fe47b30e6ede432cc899281390f99b356fd177.scope loaded failed failed libcontainer container 00d42a28bf820424f25c827879fe47b30e6ede432cc899281390f99b356fd177
● crio-0a845712016e7c101ed6db152c301f09c7b5654298602eb63ac5cc7938bf887a.scope loaded failed failed libcontainer container 0a845712016e7c101ed6db152c301f09c7b5654298602eb63ac5cc7938bf887a
● crio-0bed9b81c14d64a949b9573144e6cb4899f5c5cb270d12feb7e6c09c834d8d17.scope loaded failed failed libcontainer container 0bed9b81c14d64a949b9573144e6cb4899f5c5cb270d12feb7e6c09c834d8d17
● crio-128f3ce10b417acd55036ebadf858607bbc66cffb3f1bf9fdcd8b1e0e8687aa0.scope loaded failed failed libcontainer container 128f3ce10b417acd55036ebadf858607bbc66cffb3f1bf9fdcd8b1e0e8687aa0
● crio-12e4978ef3ba876fc398cb57be71325da31d54f3f8ad90a2bd9f2d4a67972430.scope loaded failed failed libcontainer container 12e4978ef3ba876fc398cb57be71325da31d54f3f8ad90a2bd9f2d4a67972430
● crio-152d82fd5fa781dfe42855f74d7c340d4d29d62dd08b06fda54617b2b2ec2c0a.scope loaded failed failed libcontainer container 152d82fd5fa781dfe42855f74d7c340d4d29d62dd08b06fda54617b2b2ec2c0a
● crio-27881b2ee679ee77671fde47a51c63ae2dcc9892696f0edaa5201f89754d9122.scope loaded failed failed libcontainer container 27881b2ee679ee77671fde47a51c63ae2dcc9892696f0edaa5201f89754d9122
● crio-2ae2535100ba3ce0f9ad5957ad54fa82303feedea7f2b4df66fdfe322f138795.scope loaded failed failed libcontainer container 2ae2535100ba3ce0f9ad5957ad54fa82303feedea7f2b4df66fdfe322f138795
● crio-2baf6d43d81c2ee885a42c9966852c3ef5d9722715127e0bbbc547e3df605c17.scope loaded failed failed libcontainer container 2baf6d43d81c2ee885a42c9966852c3ef5d9722715127e0bbbc547e3df605c17
● crio-30882963f0e4238ab38f07ea71ecd824e52ff24552a170987146a77fd9365fe5.scope loaded failed failed libcontainer container 30882963f0e4238ab38f07ea71ecd824e52ff24552a170987146a77fd9365fe5
● crio-32610f32974488f1eea10b98de9546d583e116cfca44772d00522ba18b628346.scope loaded failed failed libcontainer container 32610f32974488f1eea10b98de9546d583e116cfca44772d00522ba18b628346
● crio-32bbf85f07652a711f37fe20a121495c4ba1f87beda7da417d0ce1a4154d85a3.scope loaded failed failed libcontainer container 32bbf85f07652a711f37fe20a121495c4ba1f87beda7da417d0ce1a4154d85a3
● crio-3a988aa3b1e213652ab01d3120a366412f30f41c2dbf276ed1859203d9cb30f0.scope loaded failed failed libcontainer container 3a988aa3b1e213652ab01d3120a366412f30f41c2dbf276ed1859203d9cb30f0
● crio-3bf2707908a84f6da733624f18cd93ca189cec18c3599c5c8bcd23d6a595f5f2.scope loaded failed failed libcontainer container 3bf2707908a84f6da733624f18cd93ca189cec18c3599c5c8bcd23d6a595f5f2
● crio-3fa42574da81f33390831815f4418b032c473c77fc36b7a2d62a8e421e77f08a.scope loaded failed failed libcontainer container 3fa42574da81f33390831815f4418b032c473c77fc36b7a2d62a8e421e77f08a
● crio-468dde7358cfd91e0420e674b47014629880fb37c86ebf6e09ac5a7db5861550.scope loaded failed failed libcontainer container 468dde7358cfd91e0420e674b47014629880fb37c86ebf6e09ac5a7db5861550
● crio-4b70641664cc4b2f35aca5b1723bf5702ffd6c32550eb8a6c88fb56aa933fe6a.scope loaded failed failed libcontainer container 4b70641664cc4b2f35aca5b1723bf5702ffd6c32550eb8a6c88fb56aa933fe6a
● crio-4dde8aa6a5439cc989add180d2d07fe0532c50e8d40cdbf9934da00b2108da2d.scope loaded failed failed libcontainer container 4dde8aa6a5439cc989add180d2d07fe0532c50e8d40cdbf9934da00b2108da2d
● crio-545fa64ec56f2ee9f0cb7743e431c60174c43d4d9b1b250fb643c98ee8bf42f8.scope loaded failed failed libcontainer container 545fa64ec56f2ee9f0cb7743e431c60174c43d4d9b1b250fb643c98ee8bf42f8
● crio-54780c76e49fa37d0ad7de87c886157539c7fbac197aa07bd0890c2c873fa6f4.scope loaded failed failed libcontainer container 54780c76e49fa37d0ad7de87c886157539c7fbac197aa07bd0890c2c873fa6f4
● crio-55b835e5211a5b22f2f89da9773095e34615454ac19891f06cedc76e066d6d7c.scope loaded failed failed libcontainer container 55b835e5211a5b22f2f89da9773095e34615454ac19891f06cedc76e066d6d7c
● crio-57e23bff07ad696e4b4ab5e346ad5615bdd9886f05b8e7413068dc6880d180ba.scope loaded failed failed libcontainer container 57e23bff07ad696e4b4ab5e346ad5615bdd9886f05b8e7413068dc6880d180ba
● crio-5c6d21772815f494c5097ad534384c029d6b073ad85eae71756a306138391963.scope loaded failed failed libcontainer container 5c6d21772815f494c5097ad534384c029d6b073ad85eae71756a306138391963
● crio-647194665f57e6e7399987243e7d4e2a884db6085adaa8fe9413e2bac777d6d9.scope loaded failed failed libcontainer container 647194665f57e6e7399987243e7d4e2a884db6085adaa8fe9413e2bac777d6d9
● crio-65e6ff4c5ab68a2f0c0b3d313aa242db37cd71cfec23e6362068b4ba399965c8.scope loaded failed failed libcontainer container 65e6ff4c5ab68a2f0c0b3d313aa242db37cd71cfec23e6362068b4ba399965c8
● crio-66b117d316e3fabc2927f3378a2c21f5c70a0fd404a2d2714d9ed7db5cc5ecec.scope loaded failed failed libcontainer container 66b117d316e3fabc2927f3378a2c21f5c70a0fd404a2d2714d9ed7db5cc5ecec
● crio-7b720c4119f9a398f9aa8109502532b6fd158c20a810aa7d282bc036f8aef625.scope loaded failed failed libcontainer container 7b720c4119f9a398f9aa8109502532b6fd158c20a810aa7d282bc036f8aef625
● crio-7c038d51c6b66c568aa15497f274cc958e6ad7d32776654fe01ce44001a8905d.scope loaded failed failed libcontainer container 7c038d51c6b66c568aa15497f274cc958e6ad7d32776654fe01ce44001a8905d
● crio-7f2c73f91acb59cc19f8d0745ceca3c946f29b144073cd7aebac4aeb70263bf1.scope loaded failed failed libcontainer container 7f2c73f91acb59cc19f8d0745ceca3c946f29b144073cd7aebac4aeb70263bf1
● crio-7fdfb2d6ecd5ab0c109c452c331991bb5bffd1516778de1145deea6a2d88b546.scope loaded failed failed libcontainer container 7fdfb2d6ecd5ab0c109c452c331991bb5bffd1516778de1145deea6a2d88b546
● crio-84cc6ff88176cfe7fe09efd409823b6cc2f4ef3ba06a40d2a922bffaa9f3e7a7.scope loaded failed failed libcontainer container 84cc6ff88176cfe7fe09efd409823b6cc2f4ef3ba06a40d2a922bffaa9f3e7a7
● crio-86807e54ea18fd386848a009d104817535c95ad59556fa49e626cbc3ec8622e8.scope loaded failed failed libcontainer container 86807e54ea18fd386848a009d104817535c95ad59556fa49e626cbc3ec8622e8
● crio-8b15a7bf5daee9c39c24a9112b53b3325a22944a0110154590eecce93a221f24.scope loaded failed failed libcontainer container 8b15a7bf5daee9c39c24a9112b53b3325a22944a0110154590eecce93a221f24
● crio-8bfd04fed560e1e2d34429f0dfd20b3e4776393b9b38c26099f88b4631c6ce4b.scope loaded failed failed libcontainer container 8bfd04fed560e1e2d34429f0dfd20b3e4776393b9b38c26099f88b4631c6ce4b
● crio-8ef931bc01343256e2c95ddd471c4a427e2ffcfa16b8d8a6a91119e4ef9b0bc2.scope loaded failed failed libcontainer container 8ef931bc01343256e2c95ddd471c4a427e2ffcfa16b8d8a6a91119e4ef9b0bc2
● crio-91a21e26ca641ab7bd7016d97b82adf611cd9b85e4d8bdaf6bbd253dc212e3a4.scope loaded failed failed libcontainer container 91a21e26ca641ab7bd7016d97b82adf611cd9b85e4d8bdaf6bbd253dc212e3a4
● crio-91af1a3c00f2728c7142f0e9455ee8a4c6526b316e85fc3d4393ac382dff3692.scope loaded failed failed libcontainer container 91af1a3c00f2728c7142f0e9455ee8a4c6526b316e85fc3d4393ac382dff3692
● crio-9d264ba64893ece3c8fea17d328fdac8c283cfb680b8f3b63269320ecffbfc0c.scope loaded failed failed libcontainer container 9d264ba64893ece3c8fea17d328fdac8c283cfb680b8f3b63269320ecffbfc0c
● crio-9eb5b25a1ec3fa749bc0761fca4d8c1fd5c90b980648b36e0cccb7f60804af42.scope loaded failed failed libcontainer container 9eb5b25a1ec3fa749bc0761fca4d8c1fd5c90b980648b36e0cccb7f60804af42
● crio-a7a63aa400d5cddab8ef529bdb4632ebbe3c2949c899936d021a135080d000b5.scope loaded failed failed libcontainer container a7a63aa400d5cddab8ef529bdb4632ebbe3c2949c899936d021a135080d000b5
● crio-a7b34b013b6dfc1bd8a7911a389035818f41ffba453ac93569f8ff32aa8cdfac.scope loaded failed failed libcontainer container a7b34b013b6dfc1bd8a7911a389035818f41ffba453ac93569f8ff32aa8cdfac
● crio-aa2dfd59abfb2eb0f5bef0783b918f869aeb6294c49dadb891545a738321a9ca.scope loaded failed failed libcontainer container aa2dfd59abfb2eb0f5bef0783b918f869aeb6294c49dadb891545a738321a9ca
● crio-af380e1e340a97b7d6d6445ffd0f5884e6e0b7999b3fc1cfae6a148a5057a891.scope loaded failed failed libcontainer container af380e1e340a97b7d6d6445ffd0f5884e6e0b7999b3fc1cfae6a148a5057a891
● crio-b1460fe88ef8f5cc37e31de06734972d68ae259fd0d61b93050e422a8e89cba1.scope loaded failed failed libcontainer container b1460fe88ef8f5cc37e31de06734972d68ae259fd0d61b93050e422a8e89cba1
● crio-b63233ce05bc4d472935faf5632747312b1d5ef0398aefeb6bdcc3c662bde5bf.scope loaded failed failed libcontainer container b63233ce05bc4d472935faf5632747312b1d5ef0398aefeb6bdcc3c662bde5bf
● crio-b98ed22055cf3179ecd8ce4d6d1bb522d6ce2f28f1517e7be4e1166a00531051.scope loaded failed failed libcontainer container b98ed22055cf3179ecd8ce4d6d1bb522d6ce2f28f1517e7be4e1166a00531051
● crio-bb2dcb484369c53b71e98a2fbba07abbd987b50a9df89b2ae1c88551721c5a17.scope loaded failed failed libcontainer container bb2dcb484369c53b71e98a2fbba07abbd987b50a9df89b2ae1c88551721c5a17
● crio-c0a10215ba418f5943e016b3282cf08ebdc84dd12f42018945d36cb47d7571d6.scope loaded failed failed libcontainer container c0a10215ba418f5943e016b3282cf08ebdc84dd12f42018945d36cb47d7571d6
● crio-c0cc3723033d707cc74da598fe86f88b19518b579a3abeb575f1e0b9b34747ad.scope loaded failed failed libcontainer container c0cc3723033d707cc74da598fe86f88b19518b579a3abeb575f1e0b9b34747ad
● crio-c8ac8a186754f4123b24d871593faeb3c9427d9de15a532a42943f171ae172db.scope loaded failed failed libcontainer container c8ac8a186754f4123b24d871593faeb3c9427d9de15a532a42943f171ae172db
● crio-c8ef63f023d016687c9b53738a8f9aea0a6b80604a6dfa5e62181f73996016fc.scope loaded failed failed libcontainer container c8ef63f023d016687c9b53738a8f9aea0a6b80604a6dfa5e62181f73996016fc
● crio-dac02a8ef6525f59acbd975466184a44638f47e42b9b4f630ee4a12222c2f490.scope loaded failed failed libcontainer container dac02a8ef6525f59acbd975466184a44638f47e42b9b4f630ee4a12222c2f490
● crio-dbdb6b85ce17986500acedc751266d34259bf3efe4750b86fdd5796b3bca6536.scope loaded failed failed libcontainer container dbdb6b85ce17986500acedc751266d34259bf3efe4750b86fdd5796b3bca6536
● crio-e239973426458746255dffc8ecc759854daf3d7170f2e76aed1054dbaf87c098.scope loaded failed failed libcontainer container e239973426458746255dffc8ecc759854daf3d7170f2e76aed1054dbaf87c098
● crio-e24b1fffa08caaffa5b8fdc37c8588a2622eb78caad133be6a8c6d69113b5422.scope loaded failed failed libcontainer container e24b1fffa08caaffa5b8fdc37c8588a2622eb78caad133be6a8c6d69113b5422
● crio-e72e9e634596e4a543722fbc3681a53ef639ed2377833a6d6f0faa14831e4520.scope loaded failed failed libcontainer container e72e9e634596e4a543722fbc3681a53ef639ed2377833a6d6f0faa14831e4520
● crio-e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78.scope loaded failed failed libcontainer container e7dce5adaf9a5f7a51f33eab0b549590def1050a07182385f68e39c3c0e38c78
● crio-e801fac4e365ce94ded2182a79502df19a78534ad0aeb40bde2676767f400a9c.scope loaded failed failed libcontainer container e801fac4e365ce94ded2182a79502df19a78534ad0aeb40bde2676767f400a9c
● crio-eb5c1eed9ce3e219290b0cb8fac7ac6d8e8e169dc558126ac298acd339ebe5ae.scope loaded failed failed libcontainer container eb5c1eed9ce3e219290b0cb8fac7ac6d8e8e169dc558126ac298acd339ebe5ae
● crio-eb8792cce358372daac51faedc96d77c7ef7b45a1a2e919eb6e1cc015d16fbf0.scope loaded failed failed libcontainer container eb8792cce358372daac51faedc96d77c7ef7b45a1a2e919eb6e1cc015d16fbf0
● crio-ef103422953ab54060dec311ccc5ff6a8d54fdedb33268b63fc9d05fecd13640.scope loaded failed failed libcontainer container ef103422953ab54060dec311ccc5ff6a8d54fdedb33268b63fc9d05fecd13640
● crio-f4c155669e04f7da2ce782dce82c188b21fa29f13500d945dd539cb21b8e094a.scope loaded failed failed libcontainer container f4c155669e04f7da2ce782dce82c188b21fa29f13500d945dd539cb21b8e094a
● crio-feb7bc4dc069661a03398bbd764eef8d0a97fe31c08c3c7362c09ba789afe792.scope loaded failed failed libcontainer container feb7bc4dc069661a03398bbd764eef8d0a97fe31c08c3c7362c09ba789afe792

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

64 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.
[root@kubeletl055 ~]#

This is running the jenkins/jnlp-slave:3.27-1 image, here's the events output from the cluster:

<unknown>   Normal    Scheduled       pod/default-zxpnc   Successfully assigned interns/default-zxpnc to fpig-kubeletl055
96s         Normal    Pulled          pod/default-zxpnc   Container image "jenkins/jnlp-slave:3.27-1" already present on machine
96s         Normal    Created         pod/default-zxpnc   Created container jnlp
96s         Normal    Started         pod/default-zxpnc   Started container jnlp
93s         Normal    Killing         pod/default-zxpnc   Stopping container jnlp

What we suspect is happening is the application is terminating instantly on startup due to the maximum allowable concurrent instances so there's nothing to TERM when crio gets around to teardown (unproven theory).

haircommander commented 4 years ago

can you run with log-level debug, and get me the output of journalctl -u crio around the time it happened, plus journalctl -u crio-conmon-$ctrID.scope where $ctrID can be any of the affected containers

wrigtim commented 4 years ago

Thanks @haircommander - have log-level="debug" running now and awaiting reproduction...

wrigtim commented 4 years ago

Now have a failure with log-level set to debug...

Jul 17 04:20:09 kubeletl051 systemd[1]: Started libcontainer container e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c.
Jul 17 04:20:30 kubeletl051 systemd[1]: Stopping libcontainer container e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c.
Jul 17 04:22:00 kubeletl051 systemd[1]: crio-e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c.scope stopping timed out. Killing.

crio-conmon-e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c.scope journal:

# journalctl -u crio-conmon-e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c.scope
-- Logs begin at Thu 2020-07-16 02:02:17 CDT, end at Fri 2020-07-17 05:23:37 CDT. --
Jul 17 04:20:08 kubeletl051 systemd[1]: Started crio-conmon-e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c.scope.

The debug logs are huge and I'm just ensuring we can sanitize appropriately (there's a load of other pods running here including some internal apps). Here's a few excerpts from teardown - I notice the last entry in the log for this ID is the PodSandboxStatusResponse showing State:SANDBOX_NOTREADY?

time="2020-07-17 04:20:28.874746179-05:00" level=debug msg="request: &StopPodSandboxRequest{PodSandboxId:e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c,}" file="go-grpc-middleware/chain.go:25" id=6847ec6f-97db-4e88-99f7-2ea397a85e8b
time="2020-07-17 04:20:28.907582572-05:00" level=info msg="Got pod network &{Name:jcid-tracer-1594977600-9txl5 Namespace:network ID:e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c NetNS:/proc/642680/ns/net Networks:[{Name:k8s-pod-network Ifname:eth0}] RuntimeConfig:map[k8s-pod-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}]}" file="ocicni/ocicni.go:797"
time="2020-07-17 04:20:30.132808880-05:00" level=debug msg="request: &StopPodSandboxRequest{PodSandboxId:e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c,}" file="go-grpc-middleware/chain.go:25" id=25c3b586-f9aa-44bf-a9c8-7bddb0dbc3d4
time="2020-07-17 04:20:30.227968162-05:00" level=debug msg="event: \"/var/run/crio/exits/e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c.9XRZN0\": CREATE" file="server/server.go:622"
time="2020-07-17 04:20:30.228037009-05:00" level=debug msg="container or sandbox exited: e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c.9XRZN0" file="server/server.go:625"
time="2020-07-17 04:20:30.228108608-05:00" level=debug msg="event: \"/var/run/crio/exits/e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c.9XRZN0\": RENAME" file="server/server.go:622"
time="2020-07-17 04:20:30.228186189-05:00" level=debug msg="event: \"/var/run/crio/exits/e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c\": CREATE" file="server/server.go:622"
time="2020-07-17 04:20:30.228247318-05:00" level=debug msg="container or sandbox exited: e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c" file="server/server.go:625"
time="2020-07-17 04:20:30.228320379-05:00" level=debug msg="sandbox exited and found: e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c" file="server/server.go:643"

<SNIP>

time="2020-07-17 04:20:30.917789338-05:00" level=debug msg="response: &PodSandboxStatusResponse{Status:&PodSandboxStatus{Id:e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c,Metadata:&PodSandboxMetadata{Name:jcid-tracer-1594977600-9txl5,Uid:f4965485-efd8-4c5b-a989-f605b059b653,Namespace:network,Attempt:0,},State:SANDBOX_NOTREADY,CreatedAt:1594977609244094215,Network:&PodSandboxNetworkStatus{Ip:7.150.144.205,AdditionalIps:[]*PodIP{},},Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,},},},Labels:map[string]string{controller-uid: 2b49e725-40bf-4a92-8132-c2df2128e679,io.kubernetes.container.name: POD,io.kubernetes.pod.name: jcid-tracer-1594977600-9txl5,io.kubernetes.pod.namespace: network,io.kubernetes.pod.uid: f4965485-efd8-4c5b-a989-f605b059b653,job-name: jcid-tracer-1594977600,},Annotations:map[string]string{kubernetes.io/config.seen: 2020-07-17T04:20:08.433080636-05:00,kubernetes.io/config.source: api,kubernetes.io/limit-ranger: LimitRanger plugin set: ephemeral-storage request for container config-tracer; ephemeral-storage limit for container config-tracer,},RuntimeHandler:,},Info:map[string]string{},}" file="go-grpc-middleware/chain.go:25" id=bc715cff-ad5b-4602-aa91-b2746b3e558c
haircommander commented 4 years ago

yes, that is working correctly. The pod gets a stop request, which trickles down to the infra container, and we find that the infra container has exited, meaning the sandbox is no longer ready.

Unfortunately, this doesn't help me yet. I need the logs for the container. You can find those logs by first correlating pod id to pod name (here we have: e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c==jcid-tracer-1594977600-9txl5), then searching for the associated container ID with that pod name on the create container call. THEN I need the conmon logs of the container ID.

Basically, the pod infra container is not the one that quickly exits, it's your jnlp-slave container, and I'm trying to figure out if conmon and cri-o are properly communicating about that exit :)

wrigtim commented 4 years ago

Gotcha. So looks like we have:

level=info msg="Created container 021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349: network/jcid-tracer-1594977600-9txl5/smartnic-config-tracer"

And conmon executed with args:

time="2020-07-17 04:20:09.496848392-05:00" level=debug msg="running conmon: /usr/local/bin/conmon" args="[-s -c 021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349 -n k8s_smartnic-config-tracer_jcid-tracer-1594977600-9txl5_network_f4965485-efd8-4c5b-a989-f605b059b653_0 -u 021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349 -r /usr/bin/runc -b /var/run/containers/storage/overlay-containers/021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349/userdata --persist-dir /var/lib/containers/storage/overlay-containers/021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349/userdata -p /var/run/containers/storage/overlay-containers/021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349/userdata/pidfile -P /var/run/containers/storage/overlay-containers/021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349/userdata/conmon-pidfile -l /var/log/pods/network_jcid-tracer-1594977600-9txl5_f4965485-efd8-4c5b-a989-f605b059b653/smartnic-config-tracer/0.log --exit-dir /var/run/crio/exits --socket-dir-path /var/run/crio --log-level debug --runtime-arg --root=/run/runc --log-size-max 2000000]" file="oci/runtime_oci.go:129"
time="2020-07-17 04:20:09.497137033-05:00" level=debug msg="Running conmon under custom slice system.slice and unitName crio-conmon-021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349.scope" file="oci/oci_linux.go:67"

That container log (start/finish) shows:

2020-07-17T04:20:10.222180316-05:00 stdout F [INFO] Script is starting
<SNIP>
2020-07-17T04:20:30.134689631-05:00 stderr F Shutting down, got signal: Terminated

Then the crio debug log (for that ContainerId) shows:

time="2020-07-17 04:20:09.472961902-05:00" level=debug msg="created container \"021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349\"" file="storage/runtime.go:283"
time="2020-07-17 04:20:09.485780470-05:00" level=debug msg="container \"021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349\" has work directory \"/var/lib/containers/storage/overlay-containers/021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349/userdata\"" file="storage/runtime.go:323"
time="2020-07-17 04:20:09.485936958-05:00" level=debug msg="container \"021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349\" has run directory \"/var/run/containers/storage/overlay-containers/021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349/userdata\"" file="storage/runtime.go:333"
time="2020-07-17 04:20:09.492214816-05:00" level=warning msg="requested logPath for ctr id 021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349 is a relative path: smartnic-config-tracer/0.log" file="server/container_create_linux.go:459" id=686cdd37-5631-4eeb-a52e-f94afda10c1c
time="2020-07-17 04:20:09.494840260-05:00" level=debug msg="mounted container \"021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349\" at \"/var/lib/containers/storage/overlay/f2ee9871cfa3195d17b26c20c48eb291fd262647814ddf320115117c99897710/merged\"" file="storage/runtime.go:426"
time="2020-07-17 04:20:09.496848392-05:00" level=debug msg="running conmon: /usr/local/bin/conmon" args="[-s -c 021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349 -n k8s_smartnic-config-tracer_jcid-tracer-1594977600-9txl5_network_f4965485-efd8-4c5b-a989-f605b059b653_0 -u 021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349 -r /usr/bin/runc -b /var/run/containers/storage/overlay-containers/021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349/userdata --persist-dir /var/lib/containers/storage/overlay-containers/021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349/userdata -p /var/run/containers/storage/overlay-containers/021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349/userdata/pidfile -P /var/run/containers/storage/overlay-containers/021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349/userdata/conmon-pidfile -l /var/log/pods/network_jcid-tracer-1594977600-9txl5_f4965485-efd8-4c5b-a989-f605b059b653/smartnic-config-tracer/0.log --exit-dir /var/run/crio/exits --socket-dir-path /var/run/crio --log-level debug --runtime-arg --root=/run/runc --log-size-max 2000000]" file="oci/runtime_oci.go:129"
time="2020-07-17 04:20:09.497137033-05:00" level=debug msg="Running conmon under custom slice system.slice and unitName crio-conmon-021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349.scope" file="oci/oci_linux.go:67"
time="2020-07-17 04:20:09.829410428-05:00" level=info msg="Created container 021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349: network/jcid-tracer-1594977600-9txl5/smartnic-config-tracer" file="server/container_create.go:620" id=686cdd37-5631-4eeb-a52e-f94afda10c1c
time="2020-07-17 04:20:09.829536001-05:00" level=debug msg="response: &CreateContainerResponse{ContainerId:021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349,}" file="go-grpc-middleware/chain.go:25" id=686cdd37-5631-4eeb-a52e-f94afda10c1c
time="2020-07-17 04:20:09.829906862-05:00" level=debug msg="request: &StartContainerRequest{ContainerId:021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349,}" file="go-grpc-middleware/chain.go:25" id=0b636d91-1ea8-42a4-9d81-f7b5d06aa830
time="2020-07-17 04:20:09.841821120-05:00" level=info msg="Started container 021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349: network/jcid-tracer-1594977600-9txl5/smartnic-config-tracer" file="server/container_start.go:41" id=0b636d91-1ea8-42a4-9d81-f7b5d06aa830
time="2020-07-17 04:20:09.854305086-05:00" level=debug msg="request: &ExecSyncRequest{ContainerId:021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349,Cmd:[/usr/bin/ssh-keyscan git.localdomain >> /root/.ssh/known_hosts],Timeout:0,}" file="go-grpc-middleware/chain.go:25" id=d5cd5135-a2e4-43e8-9103-34e56be7901d
time="2020-07-17 04:20:10.537701735-05:00" level=debug msg="request: &ContainerStatusRequest{ContainerId:021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349,Verbose:false,}" file="go-grpc-middleware/chain.go:25" id=7c5ec166-f20a-4bc1-8bc6-d8b8ae6b2661
time="2020-07-17 04:20:10.537842605-05:00" level=debug msg="response: &ContainerStatusResponse{Status:&ContainerStatus{Id:021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349,Metadata:&ContainerMetadata{Name:smartnic-config-tracer,Attempt:0,},State:CONTAINER_RUNNING,CreatedAt:1594977609812701821,StartedAt:1594977609841812882,FinishedAt:0,ExitCode:0,Image:&ImageSpec{Image:quay.localdomain/network/jtnet_smartnic_config_tracers:release-2020-07-16-17-39-43,},ImageRef:quay.localdomain/network/jtnet_smartnic_config_tracers@sha256:f7b6c9c956d1c941203ce4d2a36bff6d60717dc3598d5ebf866866a3480f9468,Reason:,Message:,Labels:map[string]string{io.kubernetes.container.name: smartnic-config-tracer,io.kubernetes.pod.name: jcid-tracer-1594977600-9txl5,io.kubernetes.pod.namespace: network,io.kubernetes.pod.uid: f4965485-efd8-4c5b-a989-f605b059b653,},Annotations:map[string]string{io.kubernetes.container.hash: 7d165558,io.kubernetes.container.restartCount: 0,io.kubernetes.container.terminationMessagePath: /dev/termination-log,io.kubernetes.container.terminationMessagePolicy: File,io.kubernetes.pod.terminationGracePeriod: 30,},Mounts:[]*Mount{&Mount{ContainerPath:/root/.ssh,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/volumes/kubernetes.io~secret/github-key,Readonly:true,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},&Mount{ContainerPath:/etc/hosts,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/etc-hosts,Readonly:false,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},&Mount{ContainerPath:/dev/termination-log,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/containers/smartnic-config-tracer/6c1d6dbe,Readonly:false,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},&Mount{ContainerPath:/var/run/secrets/kubernetes.io/serviceaccount,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/volumes/kubernetes.io~secret/default-token-supersecret,Readonly:true,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},},LogPath:/var/log/pods/network_jcid-tracer-1594977600-9txl5_f4965485-efd8-4c5b-a989-f605b059b653/smartnic-config-tracer/0.log,},Info:map[string]string{},}" file="go-grpc-middleware/chain.go:25" id=7c5ec166-f20a-4bc1-8bc6-d8b8ae6b2661
time="2020-07-17 04:20:14.116520524-05:00" level=debug msg="request: &ContainerStatusRequest{ContainerId:021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349,Verbose:false,}" file="go-grpc-middleware/chain.go:25" id=951ea6cd-e6ea-46af-a09c-f74703eaef76
time="2020-07-17 04:20:14.116645180-05:00" level=debug msg="response: &ContainerStatusResponse{Status:&ContainerStatus{Id:021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349,Metadata:&ContainerMetadata{Name:smartnic-config-tracer,Attempt:0,},State:CONTAINER_RUNNING,CreatedAt:1594977609812701821,StartedAt:1594977609841812882,FinishedAt:0,ExitCode:0,Image:&ImageSpec{Image:quay.localdomain/network/jtnet_smartnic_config_tracers:release-2020-07-16-17-39-43,},ImageRef:quay.localdomain/network/jtnet_smartnic_config_tracers@sha256:f7b6c9c956d1c941203ce4d2a36bff6d60717dc3598d5ebf866866a3480f9468,Reason:,Message:,Labels:map[string]string{io.kubernetes.container.name: smartnic-config-tracer,io.kubernetes.pod.name: jcid-tracer-1594977600-9txl5,io.kubernetes.pod.namespace: network,io.kubernetes.pod.uid: f4965485-efd8-4c5b-a989-f605b059b653,},Annotations:map[string]string{io.kubernetes.container.hash: 7d165558,io.kubernetes.container.restartCount: 0,io.kubernetes.container.terminationMessagePath: /dev/termination-log,io.kubernetes.container.terminationMessagePolicy: File,io.kubernetes.pod.terminationGracePeriod: 30,},Mounts:[]*Mount{&Mount{ContainerPath:/root/.ssh,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/volumes/kubernetes.io~secret/github-key,Readonly:true,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},&Mount{ContainerPath:/etc/hosts,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/etc-hosts,Readonly:false,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},&Mount{ContainerPath:/dev/termination-log,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/containers/smartnic-config-tracer/6c1d6dbe,Readonly:false,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},&Mount{ContainerPath:/var/run/secrets/kubernetes.io/serviceaccount,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/volumes/kubernetes.io~secret/default-token-supersecret,Readonly:true,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},},LogPath:/var/log/pods/network_jcid-tracer-1594977600-9txl5_f4965485-efd8-4c5b-a989-f605b059b653/smartnic-config-tracer/0.log,},Info:map[string]string{},}" file="go-grpc-middleware/chain.go:25" id=951ea6cd-e6ea-46af-a09c-f74703eaef76
time="2020-07-17 04:20:24.149543853-05:00" level=debug msg="request: &ContainerStatusRequest{ContainerId:021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349,Verbose:false,}" file="go-grpc-middleware/chain.go:25" id=41e398d0-364f-4575-9a3c-0a16377f074d
time="2020-07-17 04:20:24.149688068-05:00" level=debug msg="response: &ContainerStatusResponse{Status:&ContainerStatus{Id:021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349,Metadata:&ContainerMetadata{Name:smartnic-config-tracer,Attempt:0,},State:CONTAINER_RUNNING,CreatedAt:1594977609812701821,StartedAt:1594977609841812882,FinishedAt:0,ExitCode:0,Image:&ImageSpec{Image:quay.localdomain/network/jtnet_smartnic_config_tracers:release-2020-07-16-17-39-43,},ImageRef:quay.localdomain/network/jtnet_smartnic_config_tracers@sha256:f7b6c9c956d1c941203ce4d2a36bff6d60717dc3598d5ebf866866a3480f9468,Reason:,Message:,Labels:map[string]string{io.kubernetes.container.name: smartnic-config-tracer,io.kubernetes.pod.name: jcid-tracer-1594977600-9txl5,io.kubernetes.pod.namespace: network,io.kubernetes.pod.uid: f4965485-efd8-4c5b-a989-f605b059b653,},Annotations:map[string]string{io.kubernetes.container.hash: 7d165558,io.kubernetes.container.restartCount: 0,io.kubernetes.container.terminationMessagePath: /dev/termination-log,io.kubernetes.container.terminationMessagePolicy: File,io.kubernetes.pod.terminationGracePeriod: 30,},Mounts:[]*Mount{&Mount{ContainerPath:/root/.ssh,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/volumes/kubernetes.io~secret/github-key,Readonly:true,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},&Mount{ContainerPath:/etc/hosts,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/etc-hosts,Readonly:false,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},&Mount{ContainerPath:/dev/termination-log,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/containers/smartnic-config-tracer/6c1d6dbe,Readonly:false,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},&Mount{ContainerPath:/var/run/secrets/kubernetes.io/serviceaccount,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/volumes/kubernetes.io~secret/default-token-supersecret,Readonly:true,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},},LogPath:/var/log/pods/network_jcid-tracer-1594977600-9txl5_f4965485-efd8-4c5b-a989-f605b059b653/smartnic-config-tracer/0.log,},Info:map[string]string{},}" file="go-grpc-middleware/chain.go:25" id=41e398d0-364f-4575-9a3c-0a16377f074d
time="2020-07-17 04:20:28.382137428-05:00" level=debug msg="event: \"/var/run/crio/exits/021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349.TKT0N0\": RENAME" file="server/server.go:622"
time="2020-07-17 04:20:28.382298175-05:00" level=debug msg="event: \"/var/run/crio/exits/021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349\": CREATE" file="server/server.go:622"
time="2020-07-17 04:20:28.382393718-05:00" level=debug msg="container or sandbox exited: 021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349" file="server/server.go:625"
time="2020-07-17 04:20:28.382536434-05:00" level=debug msg="container exited and found: 021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349" file="server/server.go:628"
time="2020-07-17 04:20:28.395907764-05:00" level=debug msg="Skipping status update for: &{State:{Version:1.0.0 ID:021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349 Status:stopped Pid:0 Bundle:/run/containers/storage/overlay-containers/021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349/userdata Annotations:map[io.container.manager:cri-o io.kubernetes.container.hash:7d165558 io.kubernetes.container.name:smartnic-config-tracer io.kubernetes.container.restartCount:0 io.kubernetes.container.terminationMessagePath:/dev/termination-log io.kubernetes.container.terminationMessagePolicy:File io.kubernetes.cri-o.Annotations:{\"io.kubernetes.container.hash\":\"7d165558\",\"io.kubernetes.container.restartCount\":\"0\",\"io.kubernetes.container.terminationMessagePath\":\"/dev/termination-log\",\"io.kubernetes.container.terminationMessagePolicy\":\"File\",\"io.kubernetes.pod.terminationGracePeriod\":\"30\"} io.kubernetes.cri-o.ContainerID:021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349 io.kubernetes.cri-o.ContainerType:container io.kubernetes.cri-o.Created:2020-07-17T04:20:09.49306136-05:00 io.kubernetes.cri-o.IP.0:7.150.144.205 io.kubernetes.cri-o.Image:274807d3d7694970bf1d1126d7f98390e5f67a4c0619723186e4954a5c605f8d io.kubernetes.cri-o.ImageName:quay.localdomain/network/jtnet_smartnic_config_tracers:release-2020-07-16-17-39-43 io.kubernetes.cri-o.ImageRef:quay.localdomain/network/jtnet_smartnic_config_tracers@sha256:f7b6c9c956d1c941203ce4d2a36bff6d60717dc3598d5ebf866866a3480f9468 io.kubernetes.cri-o.Labels:{\"io.kubernetes.container.name\":\"smartnic-config-tracer\",\"io.kubernetes.pod.name\":\"jcid-tracer-1594977600-9txl5\",\"io.kubernetes.pod.namespace\":\"network\",\"io.kubernetes.pod.uid\":\"f4965485-efd8-4c5b-a989-f605b059b653\"} io.kubernetes.cri-o.LogPath:/var/log/pods/network_jcid-tracer-1594977600-9txl5_f4965485-efd8-4c5b-a989-f605b059b653/smartnic-config-tracer/0.log io.kubernetes.cri-o.Metadata:{\"name\":\"smartnic-config-tracer\"} io.kubernetes.cri-o.MountPoint:/var/lib/containers/storage/overlay/f2ee9871cfa3195d17b26c20c48eb291fd262647814ddf320115117c99897710/merged io.kubernetes.cri-o.Name:k8s_smartnic-config-tracer_jcid-tracer-1594977600-9txl5_network_f4965485-efd8-4c5b-a989-f605b059b653_0 io.kubernetes.cri-o.ResolvPath:/var/run/containers/storage/overlay-containers/e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c/userdata/resolv.conf io.kubernetes.cri-o.SandboxID:e30f488f11547b1532ca7d66101d7e2dd417284e58ddb62884527d66a5c6458c io.kubernetes.cri-o.SandboxName:k8s_POD_jcid-tracer-1594977600-9txl5_network_f4965485-efd8-4c5b-a989-f605b059b653_0 io.kubernetes.cri-o.SeccompProfilePath: io.kubernetes.cri-o.Stdin:false io.kubernetes.cri-o.StdinOnce:false io.kubernetes.cri-o.TTY:false io.kubernetes.cri-o.Volumes:[{\"container_path\":\"/root/.ssh\",\"host_path\":\"/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/volumes/kubernetes.io~secret/github-key\",\"readonly\":true},{\"container_path\":\"/etc/hosts\",\"host_path\":\"/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/etc-hosts\",\"readonly\":false},{\"container_path\":\"/dev/termination-log\",\"host_path\":\"/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/containers/smartnic-config-tracer/6c1d6dbe\",\"readonly\":false},{\"container_path\":\"/var/run/secrets/kubernetes.io/serviceaccount\",\"host_path\":\"/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/volumes/kubernetes.io~secret/default-token-supersecret\",\"readonly\":true}] io.kubernetes.pod.name:jcid-tracer-1594977600-9txl5 io.kubernetes.pod.namespace:network io.kubernetes.pod.terminationGracePeriod:30 io.kubernetes.pod.uid:f4965485-efd8-4c5b-a989-f605b059b653]} Created:2020-07-17 09:20:09.812701821 +0000 UTC Started:2020-07-17 04:20:09.841812882 -0500 CDT m=+5457.725152290 Finished:2020-07-17 04:20:28.381021531 -0500 CDT ExitCode:0xc0012e8f08 OOMKilled:false Error:}" file="oci/runtime_oci.go:691"
time="2020-07-17 04:20:28.874001119-05:00" level=debug msg="request: &ContainerStatusRequest{ContainerId:021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349,Verbose:false,}" file="go-grpc-middleware/chain.go:25" id=6083f8e3-4a37-47c3-8356-0a01fa694e7c
time="2020-07-17 04:20:28.874152310-05:00" level=debug msg="response: &ContainerStatusResponse{Status:&ContainerStatus{Id:021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349,Metadata:&ContainerMetadata{Name:smartnic-config-tracer,Attempt:0,},State:CONTAINER_EXITED,CreatedAt:1594977609812701821,StartedAt:1594977609841812882,FinishedAt:1594977628381021531,ExitCode:0,Image:&ImageSpec{Image:quay.localdomain/network/jtnet_smartnic_config_tracers:release-2020-07-16-17-39-43,},ImageRef:quay.localdomain/network/jtnet_smartnic_config_tracers@sha256:f7b6c9c956d1c941203ce4d2a36bff6d60717dc3598d5ebf866866a3480f9468,Reason:Completed,Message:,Labels:map[string]string{io.kubernetes.container.name: smartnic-config-tracer,io.kubernetes.pod.name: jcid-tracer-1594977600-9txl5,io.kubernetes.pod.namespace: network,io.kubernetes.pod.uid: f4965485-efd8-4c5b-a989-f605b059b653,},Annotations:map[string]string{io.kubernetes.container.hash: 7d165558,io.kubernetes.container.restartCount: 0,io.kubernetes.container.terminationMessagePath: /dev/termination-log,io.kubernetes.container.terminationMessagePolicy: File,io.kubernetes.pod.terminationGracePeriod: 30,},Mounts:[]*Mount{&Mount{ContainerPath:/root/.ssh,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/volumes/kubernetes.io~secret/github-key,Readonly:true,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},&Mount{ContainerPath:/etc/hosts,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/etc-hosts,Readonly:false,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},&Mount{ContainerPath:/dev/termination-log,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/containers/smartnic-config-tracer/6c1d6dbe,Readonly:false,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},&Mount{ContainerPath:/var/run/secrets/kubernetes.io/serviceaccount,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/volumes/kubernetes.io~secret/default-token-supersecret,Readonly:true,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},},LogPath:/var/log/pods/network_jcid-tracer-1594977600-9txl5_f4965485-efd8-4c5b-a989-f605b059b653/smartnic-config-tracer/0.log,},Info:map[string]string{},}" file="go-grpc-middleware/chain.go:25" id=6083f8e3-4a37-47c3-8356-0a01fa694e7c
time="2020-07-17 04:20:30.918666327-05:00" level=debug msg="request: &ContainerStatusRequest{ContainerId:021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349,Verbose:false,}" file="go-grpc-middleware/chain.go:25" id=fe766314-ba15-4c3a-9b7c-0d6875e8f022
time="2020-07-17 04:20:30.918832497-05:00" level=debug msg="response: &ContainerStatusResponse{Status:&ContainerStatus{Id:021476377f620a6dcf137bd9f859867069bdc52b444ec2f55ac56a80b80b9349,Metadata:&ContainerMetadata{Name:smartnic-config-tracer,Attempt:0,},State:CONTAINER_EXITED,CreatedAt:1594977609812701821,StartedAt:1594977609841812882,FinishedAt:1594977628381021531,ExitCode:0,Image:&ImageSpec{Image:quay.localdomain/network/jtnet_smartnic_config_tracers:release-2020-07-16-17-39-43,},ImageRef:quay.localdomain/network/jtnet_smartnic_config_tracers@sha256:f7b6c9c956d1c941203ce4d2a36bff6d60717dc3598d5ebf866866a3480f9468,Reason:Completed,Message:,Labels:map[string]string{io.kubernetes.container.name: smartnic-config-tracer,io.kubernetes.pod.name: jcid-tracer-1594977600-9txl5,io.kubernetes.pod.namespace: network,io.kubernetes.pod.uid: f4965485-efd8-4c5b-a989-f605b059b653,},Annotations:map[string]string{io.kubernetes.container.hash: 7d165558,io.kubernetes.container.restartCount: 0,io.kubernetes.container.terminationMessagePath: /dev/termination-log,io.kubernetes.container.terminationMessagePolicy: File,io.kubernetes.pod.terminationGracePeriod: 30,},Mounts:[]*Mount{&Mount{ContainerPath:/root/.ssh,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/volumes/kubernetes.io~secret/github-key,Readonly:true,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},&Mount{ContainerPath:/etc/hosts,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/etc-hosts,Readonly:false,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},&Mount{ContainerPath:/dev/termination-log,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/containers/smartnic-config-tracer/6c1d6dbe,Readonly:false,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},&Mount{ContainerPath:/var/run/secrets/kubernetes.io/serviceaccount,HostPath:/var/lib/kubelet/pods/f4965485-efd8-4c5b-a989-f605b059b653/volumes/kubernetes.io~secret/default-token-supersecret,Readonly:true,SelinuxRelabel:false,Propagation:PROPAGATION_PRIVATE,},},LogPath:/var/log/pods/network_jcid-tracer-1594977600-9txl5_f4965485-efd8-4c5b-a989-f605b059b653/smartnic-config-tracer/0.log,},Info:map[string]string{},}" file="go-grpc-middleware/chain.go:25" id=fe766314-ba15-4c3a-9b7c-0d6875e8f022
wrigtim commented 4 years ago

@haircommander Does that help at all? Happy to collect more debug data - given the pod turnover and volume of logs on these nodes are there particular messages we should be searching for (all logs are in captured + indexed) to expose the issue?

haircommander commented 4 years ago

sorry, this fell off my radar

The good news is I think everything is working as normally, minus the leaking systemd units. That has been fixed upstream: https://github.com/cri-o/cri-o/pull/3859 which won't be available until cri-o 1.19.

That said, if your operating system has a systemd >= 236, and a runc of at least 1.0.0-rc91, you can specify a container annotation org.systemd.property.CollectMode=inactive-or-failed

that will tell systemd to clean up the failed units

Is that solution sufficient?

wrigtim commented 4 years ago

Thanks @haircommander. This cluster is currently on RHEL7.7, and so we don't have CollectMode available...

We can work around this with some hackery for now. I'm still curious why it enters failed state (needing the CollectMode adjustment in #3859) rather than terminating "correctly"?

haircommander commented 4 years ago

maybe it's the application not handling sigterm nicely? https://serverfault.com/questions/695849/services-remain-in-failed-state-after-stopped-with-systemctl

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 2 years ago

Closing this issue since it had no activity in the past 90 days.