argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.05k stars 3.2k forks source link

[Bug] pns executor cannot retrieve correct containerId with runtime cri-containerd #4302

Closed cy-zheng closed 4 years ago

cy-zheng commented 4 years ago

Summary

What happened/what you expected to happen?

When I use pns executor with containerd runtime, argo stop couldn't stop workflow pod successfully. The root cause is that containerd cgroup structure is different from docker, and pns executor failed to parse container id from /proc/{pid}/cgroup file.

Diagnostics

What Kubernetes provider are you using?

bare metal v1.13.12 with containerd at v1.3.6

What version of Argo Workflows are you running?

v2.11.3

/proc/{pid}/cgroup in docker container

root@workflow-4fd9de820af248518e1a168fc79e29c6:/# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 09:22 ?        00:00:00 /pause
root         6     0  0 09:22 ?        00:00:00 argoexec wait
root        19     0  0 09:22 ?        00:00:00 /bin/sh -c sleep 86400
root        24    19  0 09:22 ?        00:00:00 sleep 86400
root        27     0  0 09:22 pts/0    00:00:00 bash
root        36    27  0 09:23 pts/0    00:00:00 ps -ef
root@workflow-4fd9de820af248518e1a168fc79e29c6:/# 
root@workflow-4fd9de820af248518e1a168fc79e29c6:/# cat /proc/19/cgroup 
11:blkio:/kubepods/besteffort/pod11f05807-0f91-11eb-b66a-023a093e2aa4/1b3872262e6e9ad77b5fabe3139023d10c58f370499055765f39b908d07ec679
10:cpuset:/kubepods/besteffort/pod11f05807-0f91-11eb-b66a-023a093e2aa4/1b3872262e6e9ad77b5fabe3139023d10c58f370499055765f39b908d07ec679
9:freezer:/kubepods/besteffort/pod11f05807-0f91-11eb-b66a-023a093e2aa4/1b3872262e6e9ad77b5fabe3139023d10c58f370499055765f39b908d07ec679
8:memory:/kubepods/besteffort/pod11f05807-0f91-11eb-b66a-023a093e2aa4/1b3872262e6e9ad77b5fabe3139023d10c58f370499055765f39b908d07ec679
7:pids:/kubepods/besteffort/pod11f05807-0f91-11eb-b66a-023a093e2aa4/1b3872262e6e9ad77b5fabe3139023d10c58f370499055765f39b908d07ec679
6:devices:/kubepods/besteffort/pod11f05807-0f91-11eb-b66a-023a093e2aa4/1b3872262e6e9ad77b5fabe3139023d10c58f370499055765f39b908d07ec679
5:net_cls,net_prio:/kubepods/besteffort/pod11f05807-0f91-11eb-b66a-023a093e2aa4/1b3872262e6e9ad77b5fabe3139023d10c58f370499055765f39b908d07ec679
4:hugetlb:/kubepods/besteffort/pod11f05807-0f91-11eb-b66a-023a093e2aa4/1b3872262e6e9ad77b5fabe3139023d10c58f370499055765f39b908d07ec679
3:cpu,cpuacct:/kubepods/besteffort/pod11f05807-0f91-11eb-b66a-023a093e2aa4/1b3872262e6e9ad77b5fabe3139023d10c58f370499055765f39b908d07ec679
2:perf_event:/kubepods/besteffort/pod11f05807-0f91-11eb-b66a-023a093e2aa4/1b3872262e6e9ad77b5fabe3139023d10c58f370499055765f39b908d07ec679
1:name=systemd:/kubepods/besteffort/pod11f05807-0f91-11eb-b66a-023a093e2aa4/1b3872262e6e9ad77b5fabe3139023d10c58f370499055765f39b908d07ec679

/proc/{pid}/cgroup in containerd

root@workflow-c1b7ee13dda64548820efb49f14d64e2:/# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 09:30 ?        00:00:00 /pause
root         6     0  0 09:30 ?        00:00:00 argoexec wait
root        30     0  0 09:30 ?        00:00:00 /bin/sh -c sleep 86400
root        35    30  0 09:30 ?        00:00:00 sleep 86400
root        37     0  0 09:30 pts/0    00:00:00 bash
root        42    37  0 09:30 pts/0    00:00:00 ps -ef
root@workflow-c1b7ee13dda64548820efb49f14d64e2:/# 
root@workflow-c1b7ee13dda64548820efb49f14d64e2:/# cat /proc/30/cgroup 
11:blkio:/system.slice/containerd.service/kubepods-besteffort-pod30556cce_0f92_11eb_b36d_02623cf324c8.slice:cri-containerd:c688c856b21cfb29c1dbf6c14793435e44a1299dfc12add33283239bffed2620
10:memory:/system.slice/containerd.service/kubepods-besteffort-pod30556cce_0f92_11eb_b36d_02623cf324c8.slice:cri-containerd:c688c856b21cfb29c1dbf6c14793435e44a1299dfc12add33283239bffed2620
9:cpuset:/kubepods-besteffort-pod30556cce_0f92_11eb_b36d_02623cf324c8.slice:cri-containerd:c688c856b21cfb29c1dbf6c14793435e44a1299dfc12add33283239bffed2620
8:freezer:/kubepods-besteffort-pod30556cce_0f92_11eb_b36d_02623cf324c8.slice:cri-containerd:c688c856b21cfb29c1dbf6c14793435e44a1299dfc12add33283239bffed2620
7:net_cls,net_prio:/kubepods-besteffort-pod30556cce_0f92_11eb_b36d_02623cf324c8.slice:cri-containerd:c688c856b21cfb29c1dbf6c14793435e44a1299dfc12add33283239bffed2620
6:hugetlb:/kubepods-besteffort-pod30556cce_0f92_11eb_b36d_02623cf324c8.slice:cri-containerd:c688c856b21cfb29c1dbf6c14793435e44a1299dfc12add33283239bffed2620
5:devices:/system.slice/containerd.service/kubepods-besteffort-pod30556cce_0f92_11eb_b36d_02623cf324c8.slice:cri-containerd:c688c856b21cfb29c1dbf6c14793435e44a1299dfc12add33283239bffed2620
4:perf_event:/kubepods-besteffort-pod30556cce_0f92_11eb_b36d_02623cf324c8.slice:cri-containerd:c688c856b21cfb29c1dbf6c14793435e44a1299dfc12add33283239bffed2620
3:pids:/system.slice/containerd.service/kubepods-besteffort-pod30556cce_0f92_11eb_b36d_02623cf324c8.slice:cri-containerd:c688c856b21cfb29c1dbf6c14793435e44a1299dfc12add33283239bffed2620
2:cpu,cpuacct:/system.slice/containerd.service/kubepods-besteffort-pod30556cce_0f92_11eb_b36d_02623cf324c8.slice:cri-containerd:c688c856b21cfb29c1dbf6c14793435e44a1299dfc12add33283239bffed2620
1:name=systemd:/system.slice/containerd.service/kubepods-besteffort-pod30556cce_0f92_11eb_b36d_02623cf324c8.slice:cri-containerd:c688c856b21cfb29c1dbf6c14793435e44a1299dfc12add33283239bffed2620
Paste the logs from the pns executor:

root@vai-adsimulator-k8s-autoscaling-stage-master1:/home/CORP/chenyu.zheng# kubectl logs workflow-9fff344428454a82b821ccf0b5b2090e wait -n workflow-test

time="2020-10-16T06:21:28.653Z" level=info msg="Starting Workflow Executor" version=v2.11.3
time="2020-10-16T06:21:28.656Z" level=info msg="Creating PNS executor (namespace: workflow-test, pod: workflow-9fff344428454a82b821ccf0b5b2090e, pid: 6, hasOutputs: false)"
time="2020-10-16T06:21:28.656Z" level=info msg="Executor (version: v2.11.3, build_date: 2020-10-07T22:55:41Z) initialized (pod: workflow-test/workflow-9fff344428454a82b821ccf0b5b2090e) with template:\n{\"name\":\"main\",\"arguments\":{},\"inputs\":{},\"outputs\":{},\"metadata\":{\"annotations\":{\"cluster-autoscaler.kubernetes.io/safe-to-evict\":\"false\",\"instance-name\":\"las-mesos-agent-s075_39eae123b9ab458f85ad60cc06d6e798\"},\"labels\":{\"bucket-id\":\"58\",\"capos_id\":\"workflow-9fff344428454a82b821ccf0b5b2090e\",\"pod-relates-workflow\":\"true\"}},\"container\":{\"name\":\"main\",\"image\":\"docker.io/zcy19941015/pytest:v0.2\",\"command\":[\"/bin/sh\",\"-c\",\"sleep 86400\"],\"env\":[{\"name\":\"CAPOS_ID\",\"value\":\"workflow-9fff344428454a82b821ccf0b5b2090e\"},{\"name\":\"NAMESPACE\",\"value\":\"workflow-test\"}],\"resources\":{}}}"
time="2020-10-16T06:21:28.656Z" level=info msg="Waiting on main container"
time="2020-10-16T06:21:29.627Z" level=info msg="main container started with container ID: 8e5e9b55b790d3f797c7d7e0519e5b0e500cbbc03c5f85b1c196957fa8d47f5e"
time="2020-10-16T06:21:29.627Z" level=info msg="Starting annotations monitor"
time="2020-10-16T06:21:29.630Z" level=info msg="containerID kubepods-besteffort-podd2c0bbdb_0f77_11eb_b36d_02623cf324c8.slice:cri-containerd:8e5e9b55b790d3f797c7d7e0519e5b0e500cbbc03c5f85b1c196957fa8d47f5e mapped to pid 27"
time="2020-10-16T06:21:29.630Z" level=warning msg="Ignoring wait failure: Failed to determine pid for containerID 8e5e9b55b790d3f797c7d7e0519e5b0e500cbbc03c5f85b1c196957fa8d47f5e: container may have exited too quickly. Process assumed to have completed"
time="2020-10-16T06:21:29.630Z" level=info msg="Main container completed"
time="2020-10-16T06:21:29.630Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2020-10-16T06:21:29.630Z" level=info msg="Capturing script exit code"
time="2020-10-16T06:21:29.630Z" level=info msg="Getting exit code of 8e5e9b55b790d3f797c7d7e0519e5b0e500cbbc03c5f85b1c196957fa8d47f5e"
time="2020-10-16T06:21:29.630Z" level=info msg="Starting deadline monitor"
time="2020-10-16T06:21:29.630Z" level=info msg="Deadline monitor stopped"
time="2020-10-16T06:21:29.630Z" level=info msg="/argo/podmetadata/annotations updated"
time="2020-10-16T06:21:29.633Z" level=info msg="No output parameters"
time="2020-10-16T06:21:29.633Z" level=info msg="No output artifacts"
time="2020-10-16T06:21:29.633Z" level=info msg="Killing sidecars"
time="2020-10-16T06:21:29.635Z" level=info msg="Alloc=6720 TotalAlloc=14913 Sys=70592 NumGC=5 Goroutines=8"

Proposal

Add code below to https://github.com/argoproj/argo/blob/v2.11.3/workflow/executor/pns/pns.go#L407

if strings.Contains(containerID, "cri-containerd") {
    strList := strings.Split(containerID, ":")
    containerID = strList[len(strList) - 1]
}

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

cy-zheng commented 4 years ago

BTW, do you have any plan to implement a cri executor, which talk with the runtime using standard cri api? That would be more graceful and compatible with different cri backend.

alexec commented 4 years ago

Would you like to submit a PR to fix?

jessesuen commented 4 years ago

@cy-zheng looks like you have the right idea for the fix. We'd love your assistance with this!

cy-zheng commented 4 years ago

@cy-zheng looks like you have the right idea for the fix. We'd love your assistance with this!

OK. I would create a PR for this issue :)