Closed toreanderson closed 3 years ago
Fun times ahead. ^^"
Thanks. It's pretty safe to assume that it crashed here. The line is saying "throw a panic if len
is less than data_len
". The fact that data_len
was nonzero (as both fields are unsigned) implies that the packet was paged. Whether a packet is paged or not depends on random factors, which unfortunately means that this is not likely going to be as deterministic as 242. But at least the bug can be traced back to very few packet fields.
When did you update to 3.5.2?
I upgraded on the 6th of December 2016 (to fix #232). Both servers had been running stable since then.
Sorry; I'm still completely empty-handed. I didn't find anything strange by reviewing the code, and the new unit test reports nothing but happiness.
I'm wondering if you might be willing to run a custom version of 3.5.2 that would, instead of crashing, dump a bunch of additional information in the log and drop the packet. The reason why I'm hesitant to commit this enhancement to 3.5.4 is because the changes needed to pull this off are kernel-aware, and might therefore introduce new issues in other kernels.
(I'm probably being a crybaby. This kernel code hasn't changed at all for many years, and likely won't either for a long time...)
Did you update Jool to 3.5.3, and the kernel to the latest generic after the crash? If so, would you rather branch these changes off 3.5.3 instead of 3.5.2?
Aside:
When you say "within 24 hours of each other", could this by any chance be roughly the same amount of time that separated their activation? ie. Did you start the first BR, and then after 24 hours start the second one?
Are you using the atomic configuration feature?
I'd be happy to try running a version with extra debugging. Just give me a git revision to check out and I'll upgrade in my next available maintenance window.
I did not upgrade Jool following the crash (still running 3.5.2), but the kernel was upgraded to 4.4.0-79-generic after the crash. It will be upgraded to 4.4.0-81-generic (or anything newer that might come out in the mean time) on the next boot, unless you'd rather I didn't.
On December 6th the two nodes were booted five minutes after one another, they did not have the exact same uptime when crashing. To be specific, node A was booted on 2016-12-06 01:24 while node B was booted 01:29. Then node B was first to crash on 2017-06-10 16:52, followed by node A on 2017-06-11 00:15.
For what it's worth there's also a third SIIT-DC BR in my network, running virtualised in KVM. This one was also rebooted on December 6th (01:11) in order to upgrade to 4.4.0-51-generic and Jool 3.5.2. Same versions and same configuration. It has not crashed yet.
Considering that I was the one to initially suggest it, I am a little bit embarrassed to say that I still haven't gotten around to updating my automation scripts to use the atomic configuration feature...
Thanks. Working on the custom build right now. I still need to test it, so I wouldn't recommend you to download it yet.
I did not upgrade Jool following the crash (still running 3.5.2), but the kernel was upgraded to 4.4.0-79-generic after the crash. It will be upgraded to 4.4.0-81-generic (or anything newer that might come out in the mean time) on the next boot, unless you'd rather I didn't.
:/
I can't rule out the possibility that this might be a kernel bug (which at this point feels likely), in which case this update might prevent us from seeing the crash again. But I feel that I can't ask you to downgrade to 4.4.0-51-generic since that could net you unnecessary security issues.
Incidentally, is this bug preventing you from updating Jool to 3.5.3? Or do you have some other reason to stick to 3.5.2? Because if you updated the kernel anyway, we could just move over to the other extreme and just try to debug in the latest everything.
Considering that I was the one to initially suggest it, I am a little bit embarrassed to say that I still haven't gotten around to updating my automation scripts to use the atomic configuration feature...
Ok, no problem. That simplifies this.
On December 6th the two nodes were booted five minutes after one another, they did not have the exact same uptime when crashing. To be specific, node A was booted on 2016-12-06 01:24 while node B was booted 01:29. Then node B was first to crash on 2017-06-10 16:52, followed by node A on 2017-06-11 00:15.
For what it's worth there's also a third SIIT-DC BR in my network, running virtualised in KVM. This one was also rebooted on December 6th (01:11) in order to upgrade to 4.4.0-51-generic and Jool 3.5.2. Same versions and same configuration. It has not crashed yet.
Ok, thanks.
I don't recall any security fixes in the kernel since -51-generic that would be applicable to these nodes (no local users or network-exposed services), so I can downgrade if you want (provided I can still find the deb packages for -51-generic somewhere).
The reason why I haven't upgraded to 3.5.3 is simply that the changes since 3.5.2 didn't seem relevant to me. Unlike Jool (which is tied to a specific git-rev), the kernel is updated automatically on-disk by the apt packaging system, so when the node crashed and was rebooted it activated the most recent version - it wasn't a conscious decision to update the kernel but not Jool, in case I gave you that impression.
In any case, I'll run whatever versions you'd want me to. Just let me know what you prefer.
Okay, that's perfect.
(You sure you need to redownload the kernel? My Ubuntu Server doesn't uninstall old kernels by default.)
I just finished testing it. The commit is the same: 21b97b1e4fd29676ae3dd6800dc3286c70dd0690.
Please, go back to 4.4.0-51-generic to maximize the probability of getting the error again, and look out for output that looks somewhat like this:
$ dmesg
[10131.934230] Bug #247 happened!
[10131.934834] Page size: 4096
[10131.935386] Page shift: 12
[10131.935962] Inner packet l4-protocol: 6
[10131.936577] initial len: 1120
[10131.937140] initial data_len: 1000
[10131.937695] initial nr_frags: 1
[10131.938241] initial frag 0: 1000
[10131.938795] mid len: 1120
[10131.939342] mid data_len: 1000
[10131.939886] mid nr_frags: 1
[10131.940438] mid frag 0: 1000
[10131.940983] current len: 999
[10131.941515] current data_len: 1000
[10131.942031] current nr_frags: 1
[10131.942529] current frag 0: 1000
[10131.943030] Dropping packet.
It should not crash. The expression "Bug #247 happened!" is constant. This might help for grepping. "Dropping packet" is also constant.
And guess we'll have to wait another several months.
All right. Now both the BRs that crashed earlier are running Jool 21b97b1e4fd29676ae3dd6800dc3286c70dd0690 on 4.4.0-51-generic
. Then we wait...
To clarify:
To make sure that this bug does not affect more people, I have merged these tweaks (validation, debugging messages and packet drop) on the latest master. Hopefully, this will improve the probability of someone else reporting the output as well.
If you want to download Jool 3.5.4, don't let me stop you.
Downgrading to non-critical because it shouldn't crash anymore.
New crash last night with Jool v3.5.6. The events seem very similar as the one originally reported last summer. This time, the line it complains about is skbuff.h:1826
, so off by one, but I'm guessing that's simply because the kernel had been upgraded (from 4.4.0-51-generic to 4.4.0-116-generic).
My two physical boxes crashed within a minute of each other, while my third virtual one stayed up and saved the day. The three nodes are anycasted, so if there was some sort of incoming stream of «packets of death» from somewhere, it'd get rerouted from node to node as each individual one crashed, causing a domino effect. If that was the case, though, it would appear that the virtual instance is immune.
Here are the traces that appeared on the serial console as they crashed:
NOOOOOOOOOOOOOOOOOOOOO, I MESSED UP THE FAILURE DETECTION CONDITIONAL orz
(╯°□°)╯︵ ┻━┻ WHAT WAS I THINKIIIIIIIIIIIIIIIIING
All right, it's worth a fresh review anyway. I'm going to be looking into it again next week.
Update to the latest commit (68b19eaf3dece04cf37aa745e768feb728d69b21) so it doesn't crash violently anymore. Hopefully I got it right this time.
Okay, I managed to design a nuke packet that triggers an error similar to the one you got.
I think that publishing the specs of the packet would be a bad idea since no one, aside from attackers, would benefit from the information. But I think it's safe to say that it's a very, very forced packet. You can't just put it on the network and expect Jool to crash; the kernel also needs to append some impressive random miracles on its own. I wouldn't say that there is a bug in the kernel since the packet technically does not break any contracts, but damn that's some convoluted black magic there.
The bug has been fixed since commit 5c90ead8b0e593154bdae1b795650ea531118d6a.
I cannot but conjecture as to why you're getting the crashes so close to each other, which is why I'd like to leave the data collection and validations in place for at least another year, just in case.
With the release of Jool 3.5.7 I'm going to close this issue (if you don't close it earlier, that is), but feel free to reopen if the crash finds you again, as usual.
Thanks, Tore. I'm only 99% sure that we fixed the bug, but I'm glad we got that particular crack all patched up.
Great job! :+1:
I'll upgrade to 3.5.7 at first opportunity. I'm quite okay with the details of the nuke packet not being made public, although I certainly am curious... Well, you have my e-mail address, if you're willing to share more details privately. ;-)
I'll upgrade to 3.5.7 at first opportunity.
Ok.
Just to clarify: Remember that you don't need to wait for the release; it's just a formality so people can see an incrementing number and not install autoconf. The patches are already collapsed into master, and I run the entire test suite before committing to that.
Well, you have my e-mail address, if you're willing to share more details privately. ;-)
I think that's fine, but I will release 3.5.7 first.
I got the following in kernel output on one of my boxes today, running Jool 3.5.7 on kernel 4.4.0-121-generic. It's been up since 8th of May, so basically right after we last had communication on this issue.
Even though 3.5.7 it is deprecated, I thought you might be interested.
[22423863.069554] ----- JOOL OUTPUT -----
[22423863.073275] Bug #247 happened!
[22423863.074889] xlator: 1 3.5.7.0
[22423863.076155] Page size: 4096
[22423863.081809] Page shift: 12
[22423863.083446] protocols: 1 2 49
[22423863.085045] initial len: 0
[22423863.101164] initial data_len: 0
[22423863.102725] initial nr_frags: 0
[22423863.104591] mid len: 0
[22423863.106398] mid data_len: 0
[22423863.121783] mid nr_frags: 0
[22423863.123043] current len: 413
[22423863.124302] current data_len: 393
[22423863.141166] current nr_frags: 1
[22423863.142738] current frag 0: 393
[22423863.144027] skb head:ffff880403baad80 data:ffff880403baadea tail:ffff880403baadfe end:ffff880403baaec0
[22423863.163664] skb l3-hdr:ffff880403baadea l4-hdr:ffff880403baadfe payload:ffff880403baae06
[22423863.182503] packet content: a3 9c 68 7b b7 38 ce dc 94 20 dd 64 a7 ed 91 b7 a4 70 d4 b6 b7 ff e9 7c a4 a1 b8 ab 6 4 a2 90 8e 5d bf 24 b9 6e ac 5b a2 8f 23 e8 74 b0 92 a8 ac e9 70 1a 13 26 df bf 5c 90 d1 13 5 42 aa 3d 0 21 28 6a eb 4 a6 51 bf 91 ec d3 8 0 45 c0 1 b9 f1 84 0 0 37 1 55 39 8a c9 1b b5 57 ee 3c 5a 3 2 fc fd 0 0 0 0 45 0 1 9d dd 9 0 0 f5 3a ac 56 57 ee 3c 5a 8a c9 1b b5 3 0 28 0 0 0 0 0 0 0 0 5b 22 76 23 f7 16 31 91 8b 63 b7 75 95 84 4f b7 ae c8 ac bc a3 8f 72 41 65 0 64 eb d4 e7 2f e3 56 e9 7a 3a fa 26 54 25 2f 45 35 df 55 a4 ba 68 1d b3 ff 36 3c 56 6e a5 27 8e a7 ce c1 25 ea 61 17 7a 28 c9 14 3 3 0 1 1 16 3 3 0 28 0 0 0 0 0 0 0 0 5d 15 7a 38 b7 74 22 d5 13 23 62 35 24 a5 3f 5f 7 58 ae 2e 8f 64 c4 5e 3e 71 b 60 45 f9 66 5b 7b 0 43 98 65 7b 9b cb 78 c9 2f 7b 2b 91 a9 17 21 b4 bd bd ef 28 d3 41 59 3f 86 ca fd 16 3 3 0 4 e 0 0 0 74 79 16 3 3 0 4 e 0 0 0 55 57 f8 4c d2 31 ec 0 7d 2c 22 68 69 74 73
[22423863.310785] Dropping packet.
[22423863.311824] -----------------------
Even though 3.5.7 it is deprecated
It's not, though. I do not intend to work on new features for it, but we need to keep it around until 4.0 feels time-tested.
Thanks. I will work on this on the weekend.
Just... wow.
Here's a theory: The original packet wasn't IPv4, but rather, an IPv6 packet hairpinned into IPv4.
Pros:
Cons:
alloc_skb()
. This function very clearly does not page data at any point. (Objection: I'm reading the 4.4 code, which might not be the same as the 4.4.0-121-generic code. I need to look into this.)@toreanderson:
What is eam-hairpin-mode
set to?
Also: The offending packet was supposedly sent from 87.238.60.90
to 138.201.27.181
. According to your configuration, are these both IPv6 nodes?
TODOs:
skb_alloc()
. If we ever get a paged packet out of it, the source of the bug is clear.138.201.27.181
is some random node on the IPv4 Internet. Belongs to a hosting provider in Germany, if whois
is to be believed.
87.238.60.90
is part of an EAM (2a02:c0:202:a:18:59ff:fe3a:389c/128 - 87.238.60.90/32
).
2a02:c0:202:a:18:59ff:fe3a:389c
is assigned to an IPv6 server in one of our data centres.
So the packet must have originally been sent from 2a02:c0:202:a:18:59ff:fe3a:389c
, reached Jool, undergone translation to IPv4 and thus getting the new source address 87.238.60.90
. The resulting packet should then have been routed out to the IPv4 network across the Internet to 138.201.27.181
. There is no reason for it to be hairpinned back to an IPv6 packet.
That said, hairpinning is enabled:
$ sudo jool_siit | grep hairpin
--eam-hairpin-mode: 2 (intrinsic)
Are you planning to stick to 3.5? I added more debugging messages, but I'm not sure if you want to keep monitoring this.
In case you want to continue:
Since the bug seems to require some coincidences in unison to show tangible symptoms, I conjecture that it might be affecting more packets than we're noticing. I've placed broader detection conditionals (thanks to the new information) and, hopefully, the next output should appear relatively sooner.
Sample output 1:
----- JOOL OUTPUT -----
(Please report this at https://github.com/NICMx/Jool/issues/247)
Bug #247 happened!
xlator: 1 3.5.7.0
page 4096 12
protos: 1 2 1
initial len:413 data_len:393 nr_frags:1 proto:1 flags:0 frag0:393
mid len:413 data_len:357 nr_frags:1 proto:1 flags:0 frag0:357
current len:413 data_len:357 nr_frags:1 frag0:357
skb head:00000000515420d7 data:00000000d0fe8ee4 tail:00000000c2c053a2 end:00000000d89fb510
skb l3-hdr:00000000d0fe8ee4 l4-hdr:000000004e741b2c payload:0000000072c8c37e
head: <Big packet dump here>
1st frag: <Big packet dump here>
CPU: 0 PID: 11335 Comm: insmod Tainted: G C OE 4.17.0-041700-generic #201806041953
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Call Trace:
<Messy stack trace here>
Dropping packet.
-----------------------
Sample output 2:
----- JOOL OUTPUT -----
(Please report this at https://github.com/NICMx/Jool/issues/247)
alloc_skb() returned a paged skb!
s1:136 s2:453 s3:0
len:0 data_len:0 nr_frags:0
CPU: 0 PID: 11578 Comm: insmod Tainted: G C OE 4.17.0-041700-generic #201806041953
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Call Trace:
<Messy Stack Trace here>
Dropping packet.
-----------------------
Sample output 3:
----- JOOL OUTPUT -----
(Please report this at https://github.com/NICMx/Jool/issues/247)
pskb_may_pull() failed to pull!
old head:000000000ed3f0ac data:000000007de64abf tail:000000005ab342fd end:00000000cf266419 len:413 data_len:393 nr_frags:1
new head:000000000ed3f0ac data:000000007de64abf tail:000000005ab342fd end:00000000cf266419 len:413 data_len:393 nr_frags:1
CPU: 0 PID: 11719 Comm: insmod Tainted: G C OE 4.17.0-041700-generic #201806041953
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Call Trace:
<Messy Stack Trace here>
Dropping packet.
-----------------------
Just FYI, the <Messy Stack Trace here>
s are large gibberish blocks not unlike the output that you get on kernel panics. But despite the mess, the kernel should stay afloat.
Is it OK to stick to the new commit or would you like me to release 3.5.8?
I plan on upgrading to v4 eventually, but it probably will take a while before I get around to it. I am happy to run c4629dd06d42c566f589398907b586da3758b59e until then - you don't have to make a new release just for my sake.
Ok, the crash hasn't shown up in more than 2 years.
I don't have the stomach to remove the detection conditional yet, but I think it's high time to unclutter the issue list.
Feel free to reopen if the bug shows up again.
Within 24 hours of each other two of our SIIT-DC BRs running SIIT Jool v3.5.2 on Ubuntu 14.04.5 with kernel 4.4.0-51-generic panicked in a very similar fashion. They had both been running stable for over half a year, so the incidents are highly suspect. Possibly another «packet of death» issue akin to issue #232?
The only information I have about it is the traces that were output to the serial console as they crashed:
Trace from SIIT-DC BR 1 Trace from SIIT-DC BR 2
Tore