cfergeau commented 1 year ago

Sometimes, after a while, podman machine networking, or crc networking stops working. No clear reproducer, but was hit by people working on podman-desktop, by some crc users, ... Latest such issue is: https://github.com/containers/podman/issues/20639 The common symptom is that ssh access to the VM does not work. modprobe -r virtio-net && modprobe virtio-net gets the network back up in #20639.

Currently working with Florent who filed #20639 and who can reproduce it several times per week to get some traces through dlv to see if this gives a hint as to what's going on. This could be a gvproxy bug as much as a kernel or qemu bug.

Regarding the other similar bugs which have been filed/mentioned in the past, they may have the same root cause, or not. They happened on Windows + hyperv, on macos + vfkit, and I think even on linux + libvirt/qemu.

20639 was macos + qemu. This means this both happens with `gvproxy`, and with `crc daemon` + `vm` process running in the VM.

There were hints of a crc daemon crash/restart in the linux + qemu case, but not in #20639, which is why I'm thinking there could be different issues.

cfergeau commented 1 year ago

Regarding #20639, I asked Florent

to upgrade gvproxy to the latest released version as the one shipped by podman 4.7.2 is old (0.5.0 vs 0.7.1)
extract the binary for his platform as delve does not support universal macos binaries: lipo -extract arm64 -output gvproxy-darwin-arm64 ./gvproxy-darwin
replace the gvproxy binary used by podman with this gvproxy-darwin-arm64 binary
install delve: brew install delve
get some traces when the issue occurs:
```
$ dlv attach  $(pgrep gvproxy)
```

(dlv) trace /github.com\/containers\/gvisor-tap-vsock\/*/



When the tracing is done, it's possible to detach `dlv` from the process by pressing `ctrl+c` and answering 'no' when delve asks if the process should be killed.

cfergeau commented 1 year ago

Regarding https://github.com/containers/podman/issues/20639, one suggestion from @n1hility was to try to use vm/gvforwarder in the VM, and sends the network traffic over vsock rather than directly over virtio-net to see if the bug can still be reproduced.

containers / gvisor-tap-vsock

Networking issues #296

20639 was macos + qemu. This means this both happens with `gvproxy`, and with `crc daemon` + `vm` process running in the VM.

containers / gvisor-tap-vsock

Networking issues #296

20639 was macos + qemu. This means this both happens with gvproxy, and with crc daemon + vm process running in the VM.

20639 was macos + qemu. This means this both happens with `gvproxy`, and with `crc daemon` + `vm` process running in the VM.