containers / gvisor-tap-vsock

A new network stack based on gVisor
Apache License 2.0
263 stars 49 forks source link

Networking issues #296

Open cfergeau opened 1 year ago

cfergeau commented 1 year ago

Sometimes, after a while, podman machine networking, or crc networking stops working. No clear reproducer, but was hit by people working on podman-desktop, by some crc users, ... Latest such issue is: https://github.com/containers/podman/issues/20639 The common symptom is that ssh access to the VM does not work. modprobe -r virtio-net && modprobe virtio-net gets the network back up in #20639.

Currently working with Florent who filed #20639 and who can reproduce it several times per week to get some traces through dlv to see if this gives a hint as to what's going on. This could be a gvproxy bug as much as a kernel or qemu bug.

Regarding the other similar bugs which have been filed/mentioned in the past, they may have the same root cause, or not. They happened on Windows + hyperv, on macos + vfkit, and I think even on linux + libvirt/qemu.

20639 was macos + qemu. This means this both happens with gvproxy, and with crc daemon + vm process running in the VM.

There were hints of a crc daemon crash/restart in the linux + qemu case, but not in #20639, which is why I'm thinking there could be different issues.

cfergeau commented 1 year ago

Regarding #20639, I asked Florent

(dlv) trace /github.com\/containers\/gvisor-tap-vsock\/*/



When the tracing is done, it's possible to detach `dlv` from the process by pressing `ctrl+c` and answering 'no' when delve asks if the process should be killed.
cfergeau commented 1 year ago

Regarding https://github.com/containers/podman/issues/20639, one suggestion from @n1hility was to try to use vm/gvforwarder in the VM, and sends the network traffic over vsock rather than directly over virtio-net to see if the bug can still be reproduced.