deislabs / containerd-wasm-shims

containerd shims for running WebAssembly workloads in Kubernetes
Apache License 2.0
311 stars 48 forks source link

Spin Pod stuck at the Terminating state #207

Closed rgl closed 5 months ago

rgl commented 8 months ago

I've created my own independent example tinygo spin application at https://github.com/rgl/spin-http-go-example and updated my k3s playground at https://github.com/rgl/k3s-vagrant to deploy the spin-http-go-example example.

Creating and using the Pod works fine, except deleting the Pod. Although the delete request is accepted by kubernetes, the Pod is stuck at the Terminating state.

It seems something is broken, but I'm not sure what.

I've added support for containerd-shim-spin-v2 0.10.0 (spin 2.0.1 webassembly runtime) in https://github.com/rgl/k3s-vagrant/commit/e5df418f6289634830389a1c770707f1239c24e0

Can you please shed some light on how to troubleshoot this? kubectl describe on the pod, does not seem to show anything useful.

Mossaka commented 8 months ago

It might related to https://github.com/containerd/runwasi/issues/418

Can you please shed some light on how to troubleshoot this?

The best way I am aware of is to examine the containerd logs for the spin shim.

If you are using the spin shim in k3s, you can refer to this guide on how to find the containerd logs

rgl commented 8 months ago

Thank you for the pointers!

I'm starting the pod as described at https://github.com/rgl/k3s-vagrant/blob/master/example-spin.yml

Although I can submit kubectl delete for the pod, the following command never seems to return:

$ kubectl delete pod/example-spin-654dccd659-8dskm 
pod "example-spin-654dccd659-8dskm" deleted

Here's what the tail -f of the containerd logs looks like:

# tail -f /var/lib/rancher/k3s/agent/containerd/containerd.log
time="2024-02-06T09:22:42.068337280Z" level=info msg="StopContainer for \"0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919\" with timeout 30 (s)"
time="2024-02-06T09:22:42.411348488Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:example-spin-654dccd659-8tltx,Uid:d479a9b2-e1b6-4509-b35d-8f89aea4f16e,Namespace:default,Attempt:0,}"

map[string]interface {}{"cniVersion":"1.0.0", "forceAddress":true, "hairpinMode":true, "ipMasq":false, "ipam":map[string]interface {}{"ranges":[][]map[string]interface {}{[]map[string]interface {}{map[string]interface {}{"subnet":"10.12.1.0/24"}}}, "routes":[]types.Route{types.Route{Dst:net.IPNet{IP:net.IP{0xa, 0xc, 0x0, 0x0}, Mask:net.IPMask{0xff, 0xff, 0x0, 0x0}}, GW:net.IP(nil)}}, "type":"host-local"}, "isDefaultGateway":true, "isGateway":true, "mtu":(*uint)(0xc0000a8738), "name":"cbr0", "type":"bridge"}
delegateAdd: netconf sent to delegate plugin:
{"cniVersion":"1.0.0","forceAddress":true,"hairpinMode":true,"ipMasq":false,"ipam":{"ranges":[[{"subnet":"10.12.1.0/24"}]],"routes":[{"dst":"10.12.0.0/16"}],"type":"host-local"},"isDefaultGateway":true,"isGateway":true,"mtu":1500,"name":"cbr0","type":"bridge"}time="2024-02-06T09:22:42.488650624Z" level=info msg="server listen started"

time="2024-02-06T09:22:42.488707028Z" level=info msg="server started"

time="2024-02-06T09:22:42.4887162Z" level=info msg="Shim successfully started, waiting for exit signal..."

time="2024-02-06T09:22:42.494708048Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:example-spin-654dccd659-8tltx,Uid:d479a9b2-e1b6-4509-b35d-8f89aea4f16e,Namespace:default,Attempt:0,} returns sandbox id \"54a9538246e79366cc0e00f2986df53cd91adc6b225a517473ae13bd25e6ece2\""
time="2024-02-06T09:22:42.496768712Z" level=info msg="CreateContainer within sandbox \"54a9538246e79366cc0e00f2986df53cd91adc6b225a517473ae13bd25e6ece2\" for container &ContainerMetadata{Name:example,Attempt:0,}"
time="2024-02-06T09:22:42.507516899Z" level=info msg="CreateContainer within sandbox \"54a9538246e79366cc0e00f2986df53cd91adc6b225a517473ae13bd25e6ece2\" for &ContainerMetadata{Name:example,Attempt:0,} returns container id \"dfd08c5d510679cffdb312f1455d1c9f140675b5cb170e9d7d5f840b4f30f141\""
time="2024-02-06T09:22:42.508273480Z" level=info msg="StartContainer for \"dfd08c5d510679cffdb312f1455d1c9f140675b5cb170e9d7d5f840b4f30f141\""
time="2024-02-06T09:22:42.517526823Z" level=info msg="manifest is not in WASM OCI image format"

time="2024-02-06T09:22:42.518658869Z" level=info msg="cgroup manager V2 will be used"

time="2024-02-06T09:22:42.518766618Z" level=warn msg="Controller rdma is not yet implemented."

time="2024-02-06T09:22:42.518784762Z" level=warn msg="Controller misc is not yet implemented."

time="2024-02-06T09:22:42.539281631Z" level=warn msg="Controller rdma is not yet implemented."

time="2024-02-06T09:22:42.539340343Z" level=warn msg="Controller misc is not yet implemented."

time="2024-02-06T09:22:42.545786429Z" level=info msg="close_range; preserve_fds=0"

time="2024-02-06T09:22:42.972849735Z" level=warn msg="intermediate process already reaped"

time="2024-02-06T09:22:42.976602376Z" level=info msg="starting instance: dfd08c5d510679cffdb312f1455d1c9f140675b5cb170e9d7d5f840b4f30f141"

time="2024-02-06T09:22:42.977461079Z" level=info msg="calling start function"

time="2024-02-06T09:22:42.977593832Z" level=info msg="setting up wasi"

time="2024-02-06T09:22:42.983485892Z" level=info msg="StartContainer for \"dfd08c5d510679cffdb312f1455d1c9f140675b5cb170e9d7d5f840b4f30f141\" returns successfully"
time="2024-02-06T09:22:44.069400677Z" level=error msg="get state for 0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919" error="context deadline exceeded: unknown"
time="2024-02-06T09:22:44.069484079Z" level=warning msg="unknown status" status=0
time="2024-02-06T09:22:44.069543062Z" level=info msg="Stop container \"0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919\" with signal terminated"
time="2024-02-06T09:22:46.200462576Z" level=info msg="stats: StatsRequest { id: "dfd08c5d510679cffdb312f1455d1c9f140675b5cb170e9d7d5f840b4f30f141", special_fields: SpecialFields { unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } } }"

time="2024-02-06T09:22:58.77219967Z" level=info msg="instantiate_pre;"

time="2024-02-06T09:22:58.773132879Z" level=info msg=" >>> running spin trigger"

time="2024-02-06T09:22:58.773738621Z" level=info msg=" >>> notifying main thread we are about to start"

time="2024-02-06T09:22:58.774493624Z" level=info msg="Serving http://0.0.0.0:80"

time="2024-02-06T09:23:40.806373573Z" level=error msg="StopContainer for \"ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f\" failed" error="failed to kill container \"ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f\": context canceled: unknown"
time="2024-02-06T09:23:40.808160649Z" level=info msg="StopPodSandbox for \"e23b2fe57bcd33134ac414a091e42b25c1aeefd0887a3753720ba640939f9fd9\""
time="2024-02-06T09:23:42.809070520Z" level=error msg="get state for ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f" error="context deadline exceeded: unknown"
time="2024-02-06T09:23:42.809150816Z" level=warning msg="unknown status" status=0
time="2024-02-06T09:23:42.809185282Z" level=info msg="Kill container \"ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f\""
time="2024-02-06T09:25:12.068440580Z" level=error msg="StopContainer for \"0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919\" failed" error="failed to stop container \"0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919\": context deadline exceeded: unknown"
time="2024-02-06T09:25:12.069488406Z" level=info msg="StopPodSandbox for \"cc4cd0cb2be413f03b06ae3d52c18dbcfd4992672f6a10af8edf4ca91ff25474\""
time="2024-02-06T09:25:14.070067749Z" level=error msg="get state for 0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919" error="context deadline exceeded: unknown"
time="2024-02-06T09:25:14.070151733Z" level=warning msg="unknown status" status=0
time="2024-02-06T09:25:14.070191783Z" level=info msg="Kill container \"0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919\""
time="2024-02-06T09:25:40.809165923Z" level=error msg="StopPodSandbox for \"e23b2fe57bcd33134ac414a091e42b25c1aeefd0887a3753720ba640939f9fd9\" failed" error="failed to stop container \"ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f\": failed to kill container \"ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f\": context deadline exceeded: unknown"
time="2024-02-06T09:25:40.809863923Z" level=info msg="StopPodSandbox for \"91ce1fc7c9ae5f571bab103e695afea48228ed99ee6188c767de368cdd8bca31\""
time="2024-02-06T09:25:40.810007290Z" level=info msg="Container to stop \"cfeb2a380219c92f3d0fab2d57cad5542812736e185433888a2e5f1ad4adba2b\" must be in running or unknown state, current state \"CONTAINER_EXITED\""
time="2024-02-06T09:25:40.840434562Z" level=info msg="TearDown network for sandbox \"91ce1fc7c9ae5f571bab103e695afea48228ed99ee6188c767de368cdd8bca31\" successfully"
time="2024-02-06T09:25:40.840480615Z" level=info msg="StopPodSandbox for \"91ce1fc7c9ae5f571bab103e695afea48228ed99ee6188c767de368cdd8bca31\" returns successfully"
time="2024-02-06T09:25:41.758165380Z" level=info msg="StopContainer for \"ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f\" with timeout 30 (s)"
time="2024-02-06T09:25:43.759072133Z" level=error msg="get state for ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f" error="context deadline exceeded: unknown"
time="2024-02-06T09:25:43.759133691Z" level=warning msg="unknown status" status=0
time="2024-02-06T09:25:43.759198951Z" level=info msg="Skipping the sending of signal terminated to container \"ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f\" because a prior stop with timeout>0 request already sent the signal"
time="2024-02-06T09:26:13.759386968Z" level=info msg="Kill container \"ffd2781bd75f19e5157ac4762314693a4f95dbe6cc43128c28b50f117658fb5f\""

After that the logs all look the same, it looping trying/failing to kill the container.

Mossaka commented 8 months ago

time="2024-02-06T09:25:12.068440580Z" level=error msg="StopContainer for \"0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919\" failed" error="failed to stop container \"0c2c8eabdfa48729d488297075ac089ba3e5d88580801cf6a0431316351cb919\": context deadline exceeded: unknown"

This log looks pretty interesting. What version of containerd are you using?

rgl commented 8 months ago

I'm using the version bundled with k3s from https://github.com/k3s-io/k3s/releases/tag/v1.28.5+k3s1. At that page, there's a Embedded Component Versions section, which points to Containerd v1.7.11-k3s2.

alexis-langlet commented 5 months ago

Hello, I'm experiencing the same issue, is there a fix planned?

rgl commented 5 months ago

I think containerd-shim-spin was moved out of this repo into https://github.com/spinkube/containerd-shim-spin.

I also think this issue was fixed in https://github.com/spinkube/containerd-shim-spin/pull/23 but I didn't yet have a chance to try it in k3s.

I did have a chance to try https://github.com/spinkube/containerd-shim-spin/releases/tag/v0.14.1 in talos at https://github.com/rgl/terraform-libvirt-talos/commit/1c32bb7e43331abdf2ab22254372968e40889921 and there, the issue reported here no longer happens! :-)

alexis-langlet commented 5 months ago

Ok nice, thank you

rgl commented 5 months ago

I've just upgraded to https://github.com/spinkube/containerd-shim-spin/releases/tag/v0.14.1 in k3s at https://github.com/rgl/k3s-vagrant/commit/5b850810710b3fcedb4b762e5efde9b721b12a80 and the issue reported here no longer happens! :-)