Kernel panic with Jool v3.5.6 (kernel BUG at .../skbuff.h:1826)

toreanderson commented 7 years ago

Within 24 hours of each other two of our SIIT-DC BRs running SIIT Jool v3.5.2 on Ubuntu 14.04.5 with kernel 4.4.0-51-generic panicked in a very similar fashion. They had both been running stable for over half a year, so the incidents are highly suspect. Possibly another «packet of death» issue akin to issue #232?

The only information I have about it is the traces that were output to the serial console as they crashed:

Trace from SIIT-DC BR 1 Trace from SIIT-DC BR 2

Tore

ydahhrk commented 7 years ago

Fun times ahead. ^^"

Thanks. It's pretty safe to assume that it crashed here. The line is saying "throw a panic if len is less than data_len". The fact that data_len was nonzero (as both fields are unsigned) implies that the packet was paged. Whether a packet is paged or not depends on random factors, which unfortunately means that this is not likely going to be as deterministic as 242. But at least the bug can be traced back to very few packet fields.

When did you update to 3.5.2?

toreanderson commented 7 years ago

I upgraded on the 6th of December 2016 (to fix #232). Both servers had been running stable since then.

ydahhrk commented 7 years ago

Sorry; I'm still completely empty-handed. I didn't find anything strange by reviewing the code, and the new unit test reports nothing but happiness.

I'm wondering if you might be willing to run a custom version of 3.5.2 that would, instead of crashing, dump a bunch of additional information in the log and drop the packet. The reason why I'm hesitant to commit this enhancement to 3.5.4 is because the changes needed to pull this off are kernel-aware, and might therefore introduce new issues in other kernels.

(I'm probably being a crybaby. This kernel code hasn't changed at all for many years, and likely won't either for a long time...)

Did you update Jool to 3.5.3, and the kernel to the latest generic after the crash? If so, would you rather branch these changes off 3.5.3 instead of 3.5.2?

Aside:

When you say "within 24 hours of each other", could this by any chance be roughly the same amount of time that separated their activation? ie. Did you start the first BR, and then after 24 hours start the second one?

Are you using the atomic configuration feature?

toreanderson commented 7 years ago

I'd be happy to try running a version with extra debugging. Just give me a git revision to check out and I'll upgrade in my next available maintenance window.

I did not upgrade Jool following the crash (still running 3.5.2), but the kernel was upgraded to 4.4.0-79-generic after the crash. It will be upgraded to 4.4.0-81-generic (or anything newer that might come out in the mean time) on the next boot, unless you'd rather I didn't.

On December 6th the two nodes were booted five minutes after one another, they did not have the exact same uptime when crashing. To be specific, node A was booted on 2016-12-06 01:24 while node B was booted 01:29. Then node B was first to crash on 2017-06-10 16:52, followed by node A on 2017-06-11 00:15.

For what it's worth there's also a third SIIT-DC BR in my network, running virtualised in KVM. This one was also rebooted on December 6th (01:11) in order to upgrade to 4.4.0-51-generic and Jool 3.5.2. Same versions and same configuration. It has not crashed yet.

Considering that I was the one to initially suggest it, I am a little bit embarrassed to say that I still haven't gotten around to updating my automation scripts to use the atomic configuration feature...

ydahhrk commented 7 years ago

Thanks. Working on the custom build right now. I still need to test it, so I wouldn't recommend you to download it yet.

I did not upgrade Jool following the crash (still running 3.5.2), but the kernel was upgraded to 4.4.0-79-generic after the crash. It will be upgraded to 4.4.0-81-generic (or anything newer that might come out in the mean time) on the next boot, unless you'd rather I didn't.

:/

I can't rule out the possibility that this might be a kernel bug (which at this point feels likely), in which case this update might prevent us from seeing the crash again. But I feel that I can't ask you to downgrade to 4.4.0-51-generic since that could net you unnecessary security issues.

Incidentally, is this bug preventing you from updating Jool to 3.5.3? Or do you have some other reason to stick to 3.5.2? Because if you updated the kernel anyway, we could just move over to the other extreme and just try to debug in the latest everything.

Considering that I was the one to initially suggest it, I am a little bit embarrassed to say that I still haven't gotten around to updating my automation scripts to use the atomic configuration feature...

Ok, no problem. That simplifies this.

ydahhrk commented 7 years ago

On December 6th the two nodes were booted five minutes after one another, they did not have the exact same uptime when crashing. To be specific, node A was booted on 2016-12-06 01:24 while node B was booted 01:29. Then node B was first to crash on 2017-06-10 16:52, followed by node A on 2017-06-11 00:15.

For what it's worth there's also a third SIIT-DC BR in my network, running virtualised in KVM. This one was also rebooted on December 6th (01:11) in order to upgrade to 4.4.0-51-generic and Jool 3.5.2. Same versions and same configuration. It has not crashed yet.

Ok, thanks.

toreanderson commented 7 years ago

I don't recall any security fixes in the kernel since -51-generic that would be applicable to these nodes (no local users or network-exposed services), so I can downgrade if you want (provided I can still find the deb packages for -51-generic somewhere).

The reason why I haven't upgraded to 3.5.3 is simply that the changes since 3.5.2 didn't seem relevant to me. Unlike Jool (which is tied to a specific git-rev), the kernel is updated automatically on-disk by the apt packaging system, so when the node crashed and was rebooted it activated the most recent version - it wasn't a conscious decision to update the kernel but not Jool, in case I gave you that impression.

In any case, I'll run whatever versions you'd want me to. Just let me know what you prefer.

ydahhrk commented 7 years ago

Okay, that's perfect.

(You sure you need to redownload the kernel? My Ubuntu Server doesn't uninstall old kernels by default.)

I just finished testing it. The commit is the same: 21b97b1e4fd29676ae3dd6800dc3286c70dd0690.

Please, go back to 4.4.0-51-generic to maximize the probability of getting the error again, and look out for output that looks somewhat like this:

$ dmesg
[10131.934230] Bug #247 happened!
[10131.934834] Page size: 4096
[10131.935386] Page shift: 12
[10131.935962] Inner packet l4-protocol: 6
[10131.936577] initial len: 1120
[10131.937140] initial data_len: 1000
[10131.937695] initial nr_frags: 1
[10131.938241]     initial frag 0: 1000
[10131.938795] mid len: 1120
[10131.939342] mid data_len: 1000
[10131.939886] mid nr_frags: 1
[10131.940438]     mid frag 0: 1000
[10131.940983] current len: 999
[10131.941515] current data_len: 1000
[10131.942031] current nr_frags: 1
[10131.942529]     current frag 0: 1000
[10131.943030] Dropping packet.

It should not crash. The expression "Bug #247 happened!" is constant. This might help for grepping. "Dropping packet" is also constant.

And guess we'll have to wait another several months.

toreanderson commented 7 years ago

All right. Now both the BRs that crashed earlier are running Jool 21b97b1e4fd29676ae3dd6800dc3286c70dd0690 on 4.4.0-51-generic. Then we wait...

ydahhrk commented 7 years ago

To clarify:

To make sure that this bug does not affect more people, I have merged these tweaks (validation, debugging messages and packet drop) on the latest master. Hopefully, this will improve the probability of someone else reporting the output as well.

If you want to download Jool 3.5.4, don't let me stop you.

Downgrading to non-critical because it shouldn't crash anymore.

toreanderson commented 6 years ago

New crash last night with Jool v3.5.6. The events seem very similar as the one originally reported last summer. This time, the line it complains about is skbuff.h:1826, so off by one, but I'm guessing that's simply because the kernel had been upgraded (from 4.4.0-51-generic to 4.4.0-116-generic).

My two physical boxes crashed within a minute of each other, while my third virtual one stayed up and saved the day. The three nodes are anycasted, so if there was some sort of incoming stream of «packets of death» from somewhere, it'd get rerouted from node to node as each individual one crashed, causing a domino effect. If that was the case, though, it would appear that the virtual instance is immune.

Here are the traces that appeared on the serial console as they crashed:

ydahhrk commented 6 years ago

NOOOOOOOOOOOOOOOOOOOOO, I MESSED UP THE FAILURE DETECTION CONDITIONAL orz

(╯°□°）╯︵ ┻━┻ WHAT WAS I THINKIIIIIIIIIIIIIIIIING

All right, it's worth a fresh review anyway. I'm going to be looking into it again next week.

Update to the latest commit (68b19eaf3dece04cf37aa745e768feb728d69b21) so it doesn't crash violently anymore. Hopefully I got it right this time.

ydahhrk commented 6 years ago

Okay, I managed to design a nuke packet that triggers an error similar to the one you got.

crash.txt (triggered on commit 474e6ae86ed8993738b3b1506fe22a503ffd212d.)

I think that publishing the specs of the packet would be a bad idea since no one, aside from attackers, would benefit from the information. But I think it's safe to say that it's a very, very forced packet. You can't just put it on the network and expect Jool to crash; the kernel also needs to append some impressive random miracles on its own. I wouldn't say that there is a bug in the kernel since the packet technically does not break any contracts, but damn that's some convoluted black magic there.

The bug has been fixed since commit 5c90ead8b0e593154bdae1b795650ea531118d6a.

I cannot but conjecture as to why you're getting the crashes so close to each other, which is why I'd like to leave the data collection and validations in place for at least another year, just in case.

With the release of Jool 3.5.7 I'm going to close this issue (if you don't close it earlier, that is), but feel free to reopen if the crash finds you again, as usual.

Thanks, Tore. I'm only 99% sure that we fixed the bug, but I'm glad we got that particular crack all patched up.

toreanderson commented 6 years ago

Great job! :+1:

I'll upgrade to 3.5.7 at first opportunity. I'm quite okay with the details of the nuke packet not being made public, although I certainly am curious... Well, you have my e-mail address, if you're willing to share more details privately. ;-)

ydahhrk commented 6 years ago

I'll upgrade to 3.5.7 at first opportunity.

Ok.

Just to clarify: Remember that you don't need to wait for the release; it's just a formality so people can see an incrementing number and not install autoconf. The patches are already collapsed into master, and I run the entire test suite before committing to that.

Well, you have my e-mail address, if you're willing to share more details privately. ;-)

I think that's fine, but I will release 3.5.7 first.

toreanderson commented 5 years ago

I got the following in kernel output on one of my boxes today, running Jool 3.5.7 on kernel 4.4.0-121-generic. It's been up since 8th of May, so basically right after we last had communication on this issue.

Even though 3.5.7 it is deprecated, I thought you might be interested.

[22423863.069554] ----- JOOL OUTPUT -----
[22423863.073275] Bug #247 happened!
[22423863.074889] xlator: 1 3.5.7.0
[22423863.076155] Page size: 4096
[22423863.081809] Page shift: 12
[22423863.083446] protocols: 1 2 49
[22423863.085045] initial len: 0
[22423863.101164] initial data_len: 0
[22423863.102725] initial nr_frags: 0
[22423863.104591] mid len: 0
[22423863.106398] mid data_len: 0
[22423863.121783] mid nr_frags: 0
[22423863.123043] current len: 413
[22423863.124302] current data_len: 393
[22423863.141166] current nr_frags: 1
[22423863.142738]     current frag 0: 393
[22423863.144027] skb head:ffff880403baad80 data:ffff880403baadea tail:ffff880403baadfe end:ffff880403baaec0
[22423863.163664] skb l3-hdr:ffff880403baadea l4-hdr:ffff880403baadfe payload:ffff880403baae06
[22423863.182503] packet content: a3 9c 68 7b b7 38 ce dc 94 20 dd 64 a7 ed 91 b7 a4 70 d4 b6 b7 ff e9 7c a4 a1 b8 ab 6 4 a2 90 8e 5d bf 24 b9 6e ac 5b a2 8f 23 e8 74 b0 92 a8 ac e9 70 1a 13 26 df bf 5c 90 d1 13 5 42 aa 3d 0 21 28 6a eb 4 a6 51 bf 91 ec d3 8 0 45 c0 1 b9 f1 84 0 0 37 1 55 39 8a c9 1b b5 57 ee 3c 5a 3 2 fc fd 0 0 0 0 45 0 1 9d dd 9 0 0 f5 3a ac 56 57 ee 3c 5a 8a c9 1b b5 3 0 28 0 0 0 0 0 0 0 0 5b 22 76 23 f7 16 31 91 8b 63 b7 75 95 84 4f b7 ae c8 ac bc a3 8f 72 41 65 0 64 eb d4 e7 2f e3 56 e9 7a 3a fa 26 54 25 2f 45 35 df 55 a4 ba 68 1d b3 ff 36 3c 56 6e a5 27 8e a7 ce c1 25 ea 61 17 7a 28 c9 14 3 3 0 1 1 16 3 3 0 28 0 0 0 0 0 0 0 0 5d 15 7a 38 b7 74 22 d5 13 23 62 35 24 a5 3f 5f 7 58 ae 2e 8f 64 c4 5e 3e 71 b 60 45 f9 66 5b 7b 0 43 98 65 7b 9b cb 78 c9 2f 7b 2b 91 a9 17 21 b4 bd bd ef 28 d3 41 59 3f 86 ca fd 16 3 3 0 4 e 0 0 0 74 79 16 3 3 0 4 e 0 0 0 55 57 f8 4c d2 31 ec 0 7d 2c 22 68 69 74 73 
[22423863.310785] Dropping packet.
[22423863.311824] -----------------------

ydahhrk commented 5 years ago

Even though 3.5.7 it is deprecated

It's not, though. I do not intend to work on new features for it, but we need to keep it around until 4.0 feels time-tested.

Thanks. I will work on this on the weekend.

ydahhrk commented 5 years ago

Just... wow.

The new output blames an IPv4 packet. The original stack traces used to blame an IPv6 packet. It looks like they are different trigger packets.
The new packet is an IPv4 header followed by an ICMPv6 header. An ICMPv6v4 packet?
There are 20 bytes in the head of the packet (ie. only an IPv4 header) and 393 bytes in the paged area. (Guess I'm going to have to stop believing that extreme pagination is unlikely.)
There is junk in the memory fragment after the IPv4 header that looks convincingly like an ICMP header. This might be a coincidence, and is required to trigger the bug. The type is 3 and the code is 0, which signals either an ICMPv4 destination network unreachable or an ICMPv6 Hop Limit exceeded.
Some packet fields are contradicting each other. The cached L4 protocol is ICMP, but the IPv4 header's protocol was 58 (ICMPv6), which doesn't match any of the IPv4 protocols during analysis. So the cached L4 protocol should be OTHER. However, if the cached protocol is OTHER, then there should also be no distance between the l4-hdr and payload pointers. On the other hand, if the cached protocol is ICMP, then the ICMP header should have been pulled into the head area. And yet... if the packet is OTHER, then it would explain why we've never gotten the bug on a stateful Jool. However you look at it, the packet needs to have fooled at least two supposedly unskippable validations, or have had at least two fields (allocated in very separate locations) corrupted after validations.
If I replicate the given packet in a unit test, the bug is not triggered. There has to be something else.

Here's a theory: The original packet wasn't IPv4, but rather, an IPv6 packet hairpinned into IPv4.

Pros:

This would explain why the original stack traces were blaming IPv6 packets. (Con: The original traces did not include any hairpinning functions.)
It would explain why the packet seems to have skipped validations. (The intermediate hairpinning packet is constructed by Jool, which means it doesn't need to be validated.)

Cons:

I can't find any code in Jool which might have assigned 58 to the intermediate packet's protocol field.
When Jool 3.5 generates an IPv4 intermediate hairpinning packet, it allocates it using alloc_skb(). This function very clearly does not page data at any point. (Objection: I'm reading the 4.4 code, which might not be the same as the 4.4.0-121-generic code. I need to look into this.)

@toreanderson:

What is eam-hairpin-mode set to?

Also: The offending packet was supposedly sent from 87.238.60.90 to 138.201.27.181. According to your configuration, are these both IPv6 nodes?

TODOs:

We need a fresh stack trace; print one during the bug detection code.
Validate the output of skb_alloc(). If we ever get a paged packet out of it, the source of the bug is clear.

toreanderson commented 5 years ago

138.201.27.181 is some random node on the IPv4 Internet. Belongs to a hosting provider in Germany, if whois is to be believed.

87.238.60.90 is part of an EAM (2a02:c0:202:a:18:59ff:fe3a:389c/128 - 87.238.60.90/32).

2a02:c0:202:a:18:59ff:fe3a:389c is assigned to an IPv6 server in one of our data centres.

So the packet must have originally been sent from 2a02:c0:202:a:18:59ff:fe3a:389c, reached Jool, undergone translation to IPv4 and thus getting the new source address 87.238.60.90. The resulting packet should then have been routed out to the IPv4 network across the Internet to 138.201.27.181. There is no reason for it to be hairpinned back to an IPv6 packet.

That said, hairpinning is enabled:

$ sudo jool_siit | grep hairpin
  --eam-hairpin-mode: 2 (intrinsic)

ydahhrk commented 5 years ago

Are you planning to stick to 3.5? I added more debugging messages, but I'm not sure if you want to keep monitoring this.

In case you want to continue:

Since the bug seems to require some coincidences in unison to show tangible symptoms, I conjecture that it might be affecting more packets than we're noticing. I've placed broader detection conditionals (thanks to the new information) and, hopefully, the next output should appear relatively sooner.

Sample output 1:

----- JOOL OUTPUT -----
(Please report this at https://github.com/NICMx/Jool/issues/247)
Bug #247 happened!
xlator: 1 3.5.7.0
page 4096 12
protos: 1 2 1
initial len:413 data_len:393 nr_frags:1 proto:1 flags:0 frag0:393 
mid len:413 data_len:357 nr_frags:1 proto:1 flags:0 frag0:357 
current len:413 data_len:357 nr_frags:1 frag0:357 
skb head:00000000515420d7 data:00000000d0fe8ee4 tail:00000000c2c053a2 end:00000000d89fb510
skb l3-hdr:00000000d0fe8ee4 l4-hdr:000000004e741b2c payload:0000000072c8c37e
head: <Big packet dump here>
1st frag: <Big packet dump here>
CPU: 0 PID: 11335 Comm: insmod Tainted: G         C OE     4.17.0-041700-generic #201806041953
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Call Trace:
 <Messy stack trace here>
Dropping packet.
-----------------------

Sample output 2:

----- JOOL OUTPUT -----
(Please report this at https://github.com/NICMx/Jool/issues/247)
alloc_skb() returned a paged skb!
s1:136 s2:453 s3:0
len:0 data_len:0 nr_frags:0
CPU: 0 PID: 11578 Comm: insmod Tainted: G         C OE     4.17.0-041700-generic #201806041953
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Call Trace:
 <Messy Stack Trace here>
Dropping packet.
-----------------------

Sample output 3:

----- JOOL OUTPUT -----
(Please report this at https://github.com/NICMx/Jool/issues/247)
pskb_may_pull() failed to pull!
old head:000000000ed3f0ac data:000000007de64abf tail:000000005ab342fd end:00000000cf266419 len:413 data_len:393 nr_frags:1
new head:000000000ed3f0ac data:000000007de64abf tail:000000005ab342fd end:00000000cf266419 len:413 data_len:393 nr_frags:1
CPU: 0 PID: 11719 Comm: insmod Tainted: G         C OE     4.17.0-041700-generic #201806041953
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Call Trace:
 <Messy Stack Trace here>
Dropping packet.
-----------------------

Just FYI, the <Messy Stack Trace here>s are large gibberish blocks not unlike the output that you get on kernel panics. But despite the mess, the kernel should stay afloat.

Is it OK to stick to the new commit or would you like me to release 3.5.8?

toreanderson commented 5 years ago

I plan on upgrading to v4 eventually, but it probably will take a while before I get around to it. I am happy to run c4629dd06d42c566f589398907b586da3758b59e until then - you don't have to make a new release just for my sake.

ydahhrk commented 3 years ago

Ok, the crash hasn't shown up in more than 2 years.

I don't have the stomach to remove the detection conditional yet, but I think it's high time to unclutter the issue list.

Feel free to reopen if the bug shows up again.

NICMx / Jool

Kernel panic with Jool v3.5.6 (kernel BUG at .../skbuff.h:1826) #247