Open basvdlei opened 8 years ago
I was able to reproduce this failure with the given script on AWS with a t2.micro instance in us-west-2 (ami-6f1eb80f).
I tested this against both the 4.7.3 and 4.9.0 kernels and it failed in the same way. It appears as though the panic is happening on the deletion of the netns (sudo ip netns del ${ns}
).
Adding in delays around the namespace deletion prevents the panic, so this looks like a race condition rather than resource exhaustion. I'm in the process of further whittling down the script to figure out a minimal repro case.
This appears to be the minimal case:
#!/bin/bash
set -x
START=${1-0}
END=${2-0}
INTERFACE=${3-ens192}
while true; do
for i in $(seq $START $END); do
ns=test${i}
vxlanif=vxlan${i}
sudo ip netns add $ns
sudo ip link add $vxlanif mtu 1450 type vxlan id $((256+$i)) dstport 4789 learning proxy l2miss l3miss dev $INTERFACE
sudo ip link set dev $vxlanif netns $ns
done
sleep 5
for i in $(seq $START $END); do
ns=test${i}
sudo ip netns del $ns
done
done
Run with:
./crash-with-iproute2.sh 0 49 eth0 &
sleep 0.1
./crash-with-iproute2.sh 50 99 eth0 &
I'm not sure at this point whether or not there is any significance to the vxlan device being a part of the namespace, or if it merely slows down the teardown enough to allow two deletions to run concurrently.
Nice, this script really hits the bug consistently/fast. It also rules out any bridge involvement. My initial suspicion that it was VMware only was also wrong, I've now confirmed panics on KVM and VMWare.
I kept wondering why we can hit this bug so reliable, when I can't seem find any other reported cases. So I tried reproducing the panic on Debian Jessie with kernel 3.18 and 4.7.0 (from backports) without any luck. But an older CoreOS version v1122.3.0 (also with kernel 4.7.0) crashes immediately
Then I got the idea to disable systemd-networkd and that seems to make the bug not hit any more!
sudo systemctl stop systemd-networkd.service
sudo systemctl mask systemd-networkd.service
Then I got the idea to disable systemd-networkd and that seems to make the bug not hit any more!
Very interesting.
1262.0.0 added the ability to tell networkd not to manage certain interfaces (on previous versions, networkd would attempt to manage every interface, even if there was no configuration). I attempted to reproduce this failure on that release with the following config, but was unable to do so (since networkd is no longer managing those interfaces).
[Match]
Name=vxlan*
[Link]
Unmanaged=yes
Could you give that release a shot and see if it works around the kernel bug for you as well?
Tested the minimal case script on a node on VMWare upgraded to 1262.0.0 with the above network unit installed and that didn't cause a panic any more. So I got back to the initial test case with Docker overlay (crash-with-docker.sh), that did still cause a panic.
Docker does rename the vxlan interfaces (to 'vx-') though, so I've changed the config to match on the driver name instead:
[Match]
Driver=vxlan
[Link]
Unmanaged=yes
Running the crash-with-docker.sh test without panics for about an hour now. I'll keep it running for a while, but I'm hopeful about this workaround.
Got the same issue with either stable or alpha by having a docker service failing in loop. So container get spawned/died back to back. I can hit the kp in few secs.
Note that I'm running vbox. I will try the mitigation above.
In my case the
[Match]
Driver=vxlan
[Link]
Unmanaged=yes
Didn't helped but disabling systemd-networkd
and maks it I no longer have the issue.
Otherwise I have this https://github.com/coreos/bugs/files/600438/20161114-Panic-3.txt kernel panic. I will try to disable management of bridges as well to see if it helps.
Looks like I had to restart the service first... This looks good for now.
Issue Report
Bug
We are getting kernel panics when experimenting with Docker's overlay network on some of our CoreOS machines running on VMWare.
CoreOS Version
Environment
VMWare ESXi
Expected Behavior
Running/starting/stopping docker containers using overlay network to not crash the system.
Actual Behavior
Kernel panic's like: -
BUG: unable to handle kernel NULL pointer dereference at 0000000000000045
-BUG: unable to handle kernel paging request at ffffffff00000158
Full kernel panic trace::
Reproduction Steps
Reproduce using Docker
/etc/systemd/system/docker.service.d/docker-ops.conf
crash-with-docker.sh:
Reproduce with iproute2
But it can also be reproduces by just using the iproute2 tools where I try to reproduce some of the steps that Docker's libnetwork uses to setup a overlay network:
crash-with-iproute2.sh:
Other Information
I've not had much luck trying to reproduce this in Vagrant (kvm or virtualbox), which suggest the problem is VMware specific.