NICMx / Jool

SIIT and NAT64 for Linux
GNU General Public License v2.0
320 stars 66 forks source link

Incorrect fragment handling on higher kernels #231

Closed ydahhrk closed 5 years ago

ydahhrk commented 7 years ago

Updated some kernel (4.4.1), found this several times on the log:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 3260 at /home/aleiva/Jool-3.5.0/mod/stateful/fragment_db.c:364 fragdb_handle+0x45/0x50 [jool]()
This code is supposed to be unreachable in kernels 3.13+! Please report.
Modules linked in: jool(OE) nf_defrag_ipv6 nf_defrag_ipv4 coretemp crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper vmw_balloon joydev cryptd input_leds vmwgfx serio_raw ttm drm_kms_helper drm fb_sys_fops bnep syscopyarea rfcomm sysfillrect sysimgblt bluetooth vmw_vmci shpchp 8250_fintek pata_acpi mac_hid i2c_piix4 parport_pc ppdev lp parport psmouse vmxnet3 mptspi mptscsih mptbase scsi_transport_spi floppy fjes
CPU: 0 PID: 3260 Comm: ping6 Tainted: G        W  OE   4.4.1-040401-generic #201601311534
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013
 0000000000000000 00000000b6bb53ca ffff88013fc03a98 ffffffff813c8e14
 ffff88013fc03ae0 ffff88013fc03ad0 ffffffff8107dba2 ffff8800b3a02000
 ffff88013fc03b98 0000000000000018 ffff8800b47ff200 000000000000dd86
Call Trace:
 <IRQ>  [<ffffffff813c8e14>] dump_stack+0x44/0x60
 [<ffffffff8107dba2>] warn_slowpath_common+0x82/0xc0
 [<ffffffff8107dc3c>] warn_slowpath_fmt+0x5c/0x80
 [<ffffffffc02f10a5>] fragdb_handle+0x45/0x50 [jool]
 [<ffffffffc02fd2ce>] core_6to4+0xbe/0x170 [jool]
 [<ffffffff810bf802>] ? __wake_up_common+0x52/0x90
 [<ffffffffc02dd5d0>] ? nf_ct_frag6_gather+0xd0/0xde0 [nf_defrag_ipv6]
 [<ffffffff81647f2f>] ? evdev_pass_values+0x1af/0x220
 [<ffffffffc02dd0dc>] ? ipv6_defrag+0xcc/0x1b0 [nf_defrag_ipv6]
 [<ffffffff8164842e>] ? evdev_events+0xae/0xd0
 [<ffffffffc02f2805>] hook_ipv6+0x15/0x20 [jool]
 [<ffffffff8172b282>] nf_iterate+0x62/0x80
 [<ffffffff8172b313>] nf_hook_slow+0x73/0xd0
 [<ffffffff817a263e>] ipv6_rcv+0x41e/0x4c0
 [<ffffffff817a1cd0>] ? ip6_make_skb+0x1e0/0x1e0
 [<ffffffff816f4536>] __netif_receive_skb_core+0x6e6/0xa40
 [<ffffffff810ad900>] ? sched_clock_init+0x60/0x90
 [<ffffffff810a74e4>] ? check_preempt_curr+0x54/0x90
 [<ffffffff810a7539>] ? ttwu_do_wakeup+0x19/0xc0
 [<ffffffff816f48a8>] __netif_receive_skb+0x18/0x60
 [<ffffffff816f5568>] process_backlog+0xa8/0x150
 [<ffffffff816f4dd0>] net_rx_action+0x210/0x320
 [<ffffffff81082446>] __do_softirq+0xf6/0x250
 [<ffffffff817ff8cc>] do_softirq_own_stack+0x1c/0x30
 <EOI>  [<ffffffff81081f28>] do_softirq.part.19+0x38/0x40
 [<ffffffff81081fad>] __local_bh_enable_ip+0x7d/0x80
 [<ffffffff8179ea67>] ip6_finish_output2+0x1a7/0x4d0
 [<ffffffff817beb8c>] ? raw6_getfrag+0xac/0x100
 [<ffffffff817a1386>] ip6_finish_output+0xa6/0x100
 [<ffffffff817a1433>] ip6_output+0x53/0x110
 [<ffffffff817d9e27>] ? __ip6_local_out+0xb7/0xd0
 [<ffffffff817d9e75>] ip6_local_out+0x35/0x40
 [<ffffffff817a1a53>] ip6_send_skb+0x23/0x70
 [<ffffffff817a1aed>] ip6_push_pending_frames+0x4d/0x50
 [<ffffffff817bff01>] rawv6_sendmsg+0xa41/0xcd0
 [<ffffffff8121a3d0>] ? poll_select_copy_remaining+0x140/0x140
 [<ffffffff816da570>] ? sock_common_recvmsg+0x40/0x70
 [<ffffffff816d7bcb>] ? sock_recvmsg+0x3b/0x50
 [<ffffffff8176e4e5>] inet_sendmsg+0x65/0xa0
 [<ffffffff816d80f8>] sock_sendmsg+0x38/0x50
 [<ffffffff816d86e1>] SYSC_sendto+0x101/0x190
 [<ffffffff816d9781>] ? __sys_recvmsg+0x51/0x90
 [<ffffffff810f06c5>] ? ktime_get_ts64+0x45/0xf0
 [<ffffffff816d920e>] SyS_sendto+0xe/0x10
 [<ffffffff817fdbb6>] entry_SYSCALL_64_fastpath+0x16/0x75
---[ end trace 79082aedf6dc6af0 ]---

This is a warning, not a panic. NAT64 only. Tested on Jool 3.5. Relevant and surrounding code did not change between 3.4 and 3.5, so this is likely a bug on both Jool series. Will confirm later.

Affects fragmented packets. I don't know what happens to them; they probably get dropped. Trying to replicate it I seem to have triggered it once by querying either Steam or the Playstation Store for the first time in a while. If my traffic really caused it, the endnodes managed to stabilize the connection automatically; I didn't notice any disruptions.

ydahhrk commented 7 years ago

Totally a 3.4 bug as well.

It seems to happen when the packets are both paged and fragmented. This seems to prevent nf_defrag_ipv6 from purging the fragment header. Indeed, the packet is getting dropped silently. I guess TCP is noticing this, hence the automatic fix.

This changes the way Jool has to cope with nf_defrag_ipv6. I'm going to have to study this [Link was dead] again.

ydahhrk commented 6 years ago

Planning to remove the defrag dependency in Jool 4, so this will be indirectly fixed by #140.

ydahhrk commented 6 years ago

Things Things

ydahhrk commented 6 years ago

This code is supposed to be unreachable in kernels 3.13+! Please report.

Turns out that this bug took too long to get fixed, and the kernels that were supposed to reach the stated code are seemingly phasing out of relevance. Right now, the earliest most relevant kernels that I'm aware of are

At this point it seems that the sensible solution is to just drop support for kernels 3.12- and call it a day. This is the current plan. If anyone disagrees, please comment.

ydahhrk commented 5 years ago

Released; closing.