OOM on Jool boxes, possible leak?

NICMx / Jool

SIIT and NAT64 for Linux

GNU General Public License v2.0

317 stars 66 forks source link

OOM on Jool boxes, possible leak? #410

Closed suffieldacademy closed 1 week ago

suffieldacademy commented 1 year ago

We have two jool boxes in an active/active load-sharing setup that we're testing for our campus. Things have been fine for months in limited testing. This week we added more test clients to the boxes and have been getting several machine lockups requiring a hard reboot. The message on the screen is typically out of memory.

This is jool 4.1.8.0 on Debian Bullseye.

I'm not a kernel expert, but looking at some of the other memory issue reports people mentioned /proc/slabinfo. I sampled that every 2 seconds and the "jool_joold_nodes" line is increasing constantly (this is on a machine that's been up less than 2 hours and 'jool session display' lists approximately 12,000 sessions):

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
jool_joold_nodes  3804464 3804464    120   34    1 : tunables    0    0    0 : slabdata 111896 111896      0
session_nodes      14106  60294    104   39    1 : tunables    0    0    0 : slabdata   1546   1546      0
bib_nodes          14532  60690     96   42    1 : tunables    0    0    0 : slabdata   1445   1445      0
jool_xlations        180    180    728   45    8 : tunables    0    0    0 : slabdata      4      4      0

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
jool_joold_nodes  3804668 3804668    120   34    1 : tunables    0    0    0 : slabdata 111902 111902      0
session_nodes      14129  60294    104   39    1 : tunables    0    0    0 : slabdata   1546   1546      0
bib_nodes          14564  60690     96   42    1 : tunables    0    0    0 : slabdata   1445   1445      0
jool_xlations        180    180    728   45    8 : tunables    0    0    0 : slabdata      4      4      0

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
jool_joold_nodes  3805620 3805620    120   34    1 : tunables    0    0    0 : slabdata 111930 111930      0
session_nodes      14150  60294    104   39    1 : tunables    0    0    0 : slabdata   1546   1546      0
bib_nodes          14557  60690     96   42    1 : tunables    0    0    0 : slabdata   1445   1445      0
jool_xlations        180    180    728   45    8 : tunables    0    0    0 : slabdata      4      4      0

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
jool_joold_nodes  3805824 3805824    120   34    1 : tunables    0    0    0 : slabdata 111936 111936      0
session_nodes      14125  60294    104   39    1 : tunables    0    0    0 : slabdata   1546   1546      0
bib_nodes          14539  60690     96   42    1 : tunables    0    0    0 : slabdata   1445   1445      0
jool_xlations        180    180    728   45    8 : tunables    0    0    0 : slabdata      4      4      0

I sampled active sessions in jool and even when those decreased, the slabs continued to increase. Meanwhile, "available" memory (as reported by top/free) has been steadily decreasing (several MiB per minute). Since we've increased the number of users, the machines have needed a reboot in as few as 20 hours.

I don't know enough about the kernel structures to know what jool_joold_nodes represents, but I'm guessing it shouldn't be monotonically increasing. I'm happy to gather any additional data that may be helpful.

One of my two boxes is locked up at the moment, but once I'm back to fully redundant I can try things like restarting jool or unloading the kernel module to see if we can recover memory without rebooting.

suffieldacademy commented 1 year ago

Some additional data:

After 12 hours I'm up to about 3GiB of memory in use.

The machine that most recently locked up showed OOM messages. The processes that were killed were all listed as <2MiB RSS, so my guess is that jool is eating up kernel memory:

[107231.7488871 Out of memory: Killed process 2866 (system total-vm:164920KB, anon-rss:0KB, file-rss:124KB, shmem-rss:OKB, UID:0 pgtables:88KB com_score_ad]:0
[107250.5731331 Out of memory: Killed process 2865 (cron) total-vm:6684kB, anon-rss:0KB, file-rss:1696kB, shmem-rss:OKB, UID:0 pgtables:56kB oom_score_ad :0
[107250.5876641 out of memory: Killed process 970 (Joold) total-vm:151412kB, anon-rss:0KB file-rss:344kB, shmem-rss:0kB, UID: 0 pgtables:60kB oom_score_adj:0

I rebooted the machine but didn't put it back into service immediately so I could get a baseline on memory. It was receiving sync state information, but not forwarding any traffic. I watched /proc/slabinfo for a bit and while session_nodes and bib_nodes were going up, jool_joold_nodes stayed constant.

As soon as I started forwarding traffic through the box, jool_joold_nodes started to increase regularly.

I then failed over and looked at the machine that was running out of memory. As soon as it wasn't forwarding traffic, jool_joold_nodes stayed constant even as session_nodes and bib_nodes fluctuated. So it seems that the memory consumption is specifically from packet processing and not states.

I restarted the joold processes (I have 2 instances in different netns), and as soon as I did that /proc/slabinfo showed the jool_joold_nodes count drop to < 1000. Free memory on the box shot back up to 3.5GiB. I did not need to unload the kernel module; simply restarting the joold instance was enough to reclaim the memory.

Any thoughts on what I should check on next?

gehoern commented 1 year ago

we have a similar issue and are not sure what you need:

jool 4.1.10 on a 6.1.42 and we have increases of 1.5mb/minute

on slabtop the entry for kmalloc-256 increases as long as jool is running. if jool is stopped the growth stops. BUT our memory is not freed.

ydahhrk commented 1 year ago

I believe I fixed it. Patch uploaded to branch issue410.

I noticed two more bugs while reviewing joold, so I will need to work on this further before the release is ready.

For the sake of transparency:

joold advertise doesn't do anything because it also relies on joold->queue, but fixing it is more complicated than the baseline patch above.
Similarly, ss-capacity isn't doing anything. (Which is why you never saw the warning, even though too many sessions are queuing.)

suffieldacademy commented 1 year ago

Thank you so much for looking into this! I'm just trying to understand your bug descriptions and have a couple questions:

Is joold advertise not working in the current code? I've been using it but have not checked the session tables to confirm it was actually sending out existing states when asked.
Should I enable ss-flush-asap? Your commit message made it sound like that might prevent (some) of the queuing.

To answer your musing in the commit message, I am definitely using Jool! My campus is working toward being IPv6-native internally, which means I need to have NAT64 functioning to provide connectivity to the legacy internet. My other favorite firewall/router (OpenBSD) doesn't have all the NAT64 features that Jool does, so that's why we are using it. I realize it's bugfix/maintenance at this point, but I'm hoping that since all the functionality I need is already in the project we can keep running it to provide this service.

Thanks!

ydahhrk commented 1 year ago

Is joold advertise not working in the current code?

I didn't actually test it, but the code review says "yes." (By which I mean it doesn't work at all.)

Should I enable ss-flush-asap?

Yes; ss-flush-asap true is mandatory in Active/Active. Notice it's true by default, so you probably already have it.

Your commit message made it sound like that might prevent (some) of the queuing.

Sure, but the queuing is probably already fixed.

I am definitely using Jool!

I'm glad, but I was specifically wondering about joold.

It's an optional feature that has often been problematic (to the point I actually regret having implemented it), and the amount of bug reports doesn't seem to match the state of the code. I mean, I casually found three bugs in a single review. That's not healthy.

suffieldacademy commented 1 year ago

Yes; ss-flush-asap true is mandatory in Active/Active. Notice it's true by default, so you probably already have it.

We actually had it off (we're doing active/passive but with two instances as a kind of poor-man's load balancing). I've enabled it again for now.

I'm glad, but I was specifically wondering about joold.

My apologies for the misunderstanding; I thought it was the project as a whole. In terms of joold, a vote of support for that as well. We picked Jool in part due to the state sync feature as we wanted to run things in a cluster.

Thanks again!

ydahhrk commented 1 year ago

Ok, all three bugs have been fixed in the issue410 branch.

Two notes:

The ss-capacity warning changed. Now it's a single line: joold: Too many sessions deferred! I need to drop some; sorry.
I deprecated ss-max-payload in favor of ss-max-sessions-per-packet.

Can you run it from source? It'd be great to get some feedback before I release it.

suffieldacademy commented 1 year ago

Sorry for the delay on this; had some other emergencies that kept getting in the way.

I have checked out the issue410 branch from git and built jool from source. I changed our script to use ss-max-sessions-per-packet, and dialed the value down to 8 (we're using HSR which has a lower MTU so I wanted to make sure we weren't going over).

Everything is up and running and I've done some light testing. /proc/slabinfo shows bib entries going up, but no runaway joold_nodes, which is great!

I'll give it a few days to see if memory use climbs, or if we note any regressions. Thanks for the quick patch!

ydahhrk commented 1 year ago

Reopening until release.

suffieldacademy commented 1 year ago

And just to provide an update, memory usage is holding steady on the machines, with no mention of joold_nodes. We're still in light testing (students have not returned to campus), but the indicators seem very promising.

suffieldacademy commented 1 year ago

Update from today. Looks like we might still be leaking memory, but not in the same place (and not nearly as quickly, since I just noticed this after 1+ week of uptime). Would you like me to open a separate issue to track? Short version is kernel memory is growing, and this time it's all being hogged by hugepages:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
22849668 22849358  99%    0.11K 634713       36   2538852K khugepaged_mm_slot

I will see if unloading the kernel module resets it when I have a minute to try.

suffieldacademy commented 1 year ago

I had a minute. ;-)

Shutting down the jool instance and unloading the kernel modules did not release the memory. I'm rebooting this box now to recover the RAM and we'll keep an eye on it.

So, not certain that it's a jool issue (e.g., could be netfilter or something else), but that's all that this box is doing so it seems connected.

ydahhrk commented 1 year ago

Would you like me to open a separate issue to track?

I personally don't mind either way. I suppose it'd depend on whether gehoern wants to keep getting notifications from this thread.

Short version is kernel memory is growing, and this time it's all being hogged by hugepages:

Do you still get increasing hugepages if you disable joold?

If you run joold in active/passive mode, which instance gets the hugepages?

Most of Jool's allocations are tracked by this thing.

If you enable it, wait for some hugepages then unload the module, maybe Jool will tell us what it's leaking.

$ make CFLAGS_MODULE="-DJKMEMLEAK"
$ sudo make install
$ sudo dmesg -C
$ sudo modprobe jool && sudo modprobe -r jool
$ sudo dmesg -ct
Jool: Core Jool v4.1.10.0 module inserted.
NAT64 Jool v4.1.10.0 module inserted.
NAT64 Jool v4.1.10.0 module removed.
Jool: Memory leaks:
Jool: None.
Jool: Core Jool v4.1.10.0 module removed.

suffieldacademy commented 1 year ago

We're active/active so it's happening on both hosts. I'm trying to turn down journaling to make sure it's not just logging that's chewing up memory.

I recompiled with the requested flag on one of the machines. Starting it back up and watching dmesg for a few minutes, I got:

[Mon Aug 28 21:01:13 2023] Jool WARNING (send_to_userspace_prepare): joold: Too many sessions deferred! I need to drop some; sorry.
[Mon Aug 28 21:02:31 2023] Jool WARNING (send_to_userspace_prepare): joold: Too many sessions deferred! I need to drop some; sorry.
[Mon Aug 28 21:03:36 2023] Jool WARNING (send_to_userspace_prepare): joold: Too many sessions deferred! I need to drop some; sorry.
[Mon Aug 28 21:04:44 2023] Jool WARNING (send_to_userspace_prepare): joold: Too many sessions deferred! I need to drop some; sorry.
[Mon Aug 28 21:05:45 2023] Jool WARNING (send_to_userspace_prepare): joold: Too many sessions deferred! I need to drop some; sorry.
[Mon Aug 28 21:06:49 2023] Jool WARNING (send_to_userspace_prepare): joold: Too many sessions deferred! I need to drop some; sorry.
[Mon Aug 28 21:07:54 2023] Jool WARNING (send_to_userspace_prepare): joold: Too many sessions deferred! I need to drop some; sorry.
[Mon Aug 28 21:08:39 2023] ------------[ cut here ]------------
[Mon Aug 28 21:08:39 2023] We're out of sync: Incoming session entry [2607:f460:1833:bf:c436:763e:41b3:6064#46149, 64:ff9b::d891:6d62#3478, 72.10.119.12#55668, 216.145.109.98#3478, UDP] collides with DB entry [2607:f460:1833:bf:c436:763e:41b3:6064#46149, 64:ff9b::d891:6d62#3478, 72
[Mon Aug 28 21:08:39 2023] WARNING: CPU: 0 PID: 4753 at /var/lib/dkms/jool/4.1.10.git.v4.1.10.7.g4fcfe184/build/src/mod/common/log.c:53 log_err+0xc0/0x120 [jool_common]
[Mon Aug 28 21:08:39 2023] Modules linked in: jool_siit(OE) jool(OE) jool_common(OE) nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables libcrc32c nfnetlink hsr(OE) binfmt_misc ipmi_ssif intel_rapl_msr intel_rapl_common nls_ascii sb_edac nls_cp437 vfat fat x86_pkg_temp_thermal intel_powerclamp coretemp joydev ghash_clmulni_intel sha512_ssse3 dcdbas sha512_generic hid_generic mgag200 usbhid ipmi_si aesni_intel hid crypto_simd drm_shmem_helper cryptd rapl drm_kms_helper intel_cstate evdev i2c_algo_bit wdat_wdt ipmi_devintf intel_uncore watchdog ipmi_msghandler pcspkr acpi_pad button acpi_power_meter mei_me sg mei ioatdma drm loop fuse efi_pstore configfs efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic dm_mirror dm_region_hash dm_log dm_mod sd_mod t10_pi sr_mod crc64_rocksoft cdrom crc64 crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32_pclmul ahci crc32c_intel ixgbe libahci ehci_pci libata ehci_hcd lpc_ich megaraid_sas xfrm_algo dca usbcore mdio_devres tg3
[Mon Aug 28 21:08:39 2023]  scsi_mod usb_common mdio libphy scsi_common ptp pps_core wmi [last unloaded: jool_common(OE)]
[Mon Aug 28 21:08:39 2023] CPU: 0 PID: 4753 Comm: joold Tainted: G        W  OE      6.1.0-11-amd64 #1  Debian 6.1.38-4
[Mon Aug 28 21:08:39 2023] Hardware name: Dell Inc. PowerEdge R620/01W23F, BIOS 2.7.0 05/23/2018
[Mon Aug 28 21:08:39 2023] RIP: 0010:log_err+0xc0/0x120 [jool_common]
[Mon Aug 28 21:08:39 2023] Code: 00 48 89 44 24 10 e8 bf 98 db de 65 8b 05 f8 c1 20 3f f6 c4 ff 0f 84 51 74 01 00 48 89 de 48 c7 c7 9f 4b e3 c0 e8 00 90 48 de <0f> 0b 48 89 df e8 36 da 69 de 48 8b 44 24 18 65 48 2b 04 25 28 00
[Mon Aug 28 21:08:39 2023] RSP: 0018:ffffb614c07f7508 EFLAGS: 00010282
[Mon Aug 28 21:08:39 2023] RAX: 0000000000000000 RBX: ffff944235450800 RCX: 0000000000000000
[Mon Aug 28 21:08:39 2023] RDX: 0000000000000203 RSI: ffffffffa054191e RDI: 00000000ffffffff
[Mon Aug 28 21:08:39 2023] RBP: ffffb614c07f7568 R08: 0000000000000000 R09: ffffb614c07f7378
[Mon Aug 28 21:08:39 2023] R10: 0000000000000003 R11: ffff94432ff23e90 R12: ffffffffc0e2fcc8
[Mon Aug 28 21:08:39 2023] R13: ffffffffc0e3594b R14: 000000000000b445 R15: 0000000000000d96
[Mon Aug 28 21:08:39 2023] FS:  00007fbc421876c0(0000) GS:ffff94432b200000(0000) knlGS:0000000000000000
[Mon Aug 28 21:08:39 2023] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Mon Aug 28 21:08:39 2023] CR2: 00007fcc7145ff88 CR3: 0000000044d76003 CR4: 00000000000606f0
[Mon Aug 28 21:08:39 2023] Call Trace:
[Mon Aug 28 21:08:39 2023]  <TASK>
[Mon Aug 28 21:08:39 2023]  ? __warn+0x7d/0xc0
[Mon Aug 28 21:08:39 2023]  ? log_err+0xc0/0x120 [jool_common]
[Mon Aug 28 21:08:39 2023]  ? report_bug+0xe6/0x170
[Mon Aug 28 21:08:39 2023]  ? handle_bug+0x41/0x70
[Mon Aug 28 21:08:39 2023]  ? exc_invalid_op+0x13/0x60
[Mon Aug 28 21:08:39 2023]  ? asm_exc_invalid_op+0x16/0x20
[Mon Aug 28 21:08:39 2023]  ? log_err+0xc0/0x120 [jool_common]
[Mon Aug 28 21:08:39 2023]  ? log_err+0xc0/0x120 [jool_common]
[Mon Aug 28 21:08:39 2023]  collision_cb+0xd5/0xf0 [jool_common]
[Mon Aug 28 21:08:39 2023]  decide_fate+0x67/0x230 [jool_common]
[Mon Aug 28 21:08:39 2023]  bib_add_session+0x1b1/0x320 [jool_common]
[Mon Aug 28 21:08:39 2023]  add_new_session+0xba/0x110 [jool_common]
[Mon Aug 28 21:08:39 2023]  ? add_new_session+0x110/0x110 [jool_common]
[Mon Aug 28 21:08:39 2023]  joold_sync+0xaf/0xe0 [jool_common]
[Mon Aug 28 21:08:39 2023]  handle_joold_add+0x69/0x100 [jool_common]
[Mon Aug 28 21:08:39 2023]  ? handling_hairpinning_siit+0x170/0x170 [jool_common]
[Mon Aug 28 21:08:39 2023]  ? is_hairpin_nat64+0x40/0x40 [jool_common]
[Mon Aug 28 21:08:39 2023]  genl_family_rcv_msg_doit+0x100/0x160
[Mon Aug 28 21:08:39 2023]  genl_rcv_msg+0x122/0x250
[Mon Aug 28 21:08:39 2023]  ? handle_instance_flush+0x70/0x70 [jool_common]
[Mon Aug 28 21:08:39 2023]  ? genl_start+0x160/0x160
[Mon Aug 28 21:08:39 2023]  netlink_rcv_skb+0x51/0x100
[Mon Aug 28 21:08:39 2023]  genl_rcv+0x24/0x40
[Mon Aug 28 21:08:39 2023]  netlink_unicast+0x242/0x390
[Mon Aug 28 21:08:39 2023]  netlink_sendmsg+0x250/0x4c0
[Mon Aug 28 21:08:39 2023]  sock_sendmsg+0x5f/0x70
[Mon Aug 28 21:08:39 2023]  ____sys_sendmsg+0x277/0x2f0
[Mon Aug 28 21:08:39 2023]  ? copy_msghdr_from_user+0x7d/0xc0
[Mon Aug 28 21:08:39 2023]  ___sys_sendmsg+0x9a/0xe0
[Mon Aug 28 21:08:39 2023]  __sys_sendmsg+0x76/0xc0
[Mon Aug 28 21:08:39 2023]  do_syscall_64+0x5b/0xc0
[Mon Aug 28 21:08:39 2023]  ? fpregs_assert_state_consistent+0x22/0x50
[Mon Aug 28 21:08:39 2023]  ? exit_to_user_mode_prepare+0x40/0x1d0
[Mon Aug 28 21:08:39 2023]  ? syscall_exit_to_user_mode+0x17/0x40
[Mon Aug 28 21:08:39 2023]  ? do_syscall_64+0x67/0xc0
[Mon Aug 28 21:08:39 2023]  entry_SYSCALL_64_after_hwframe+0x69/0xd3
[Mon Aug 28 21:08:39 2023] RIP: 0033:0x7fbc42a9594d
[Mon Aug 28 21:08:39 2023] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 7a ad f7 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 ce ad f7 ff 48
[Mon Aug 28 21:08:39 2023] RSP: 002b:00007fbc421864b0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[Mon Aug 28 21:08:39 2023] RAX: ffffffffffffffda RBX: 000055649ab67960 RCX: 00007fbc42a9594d
[Mon Aug 28 21:08:39 2023] RDX: 0000000000000000 RSI: 00007fbc42186510 RDI: 0000000000000004
[Mon Aug 28 21:08:39 2023] RBP: 00007fbc34000da0 R08: 0000000000000000 R09: 0000000000000000
[Mon Aug 28 21:08:39 2023] R10: 0000000000000000 R11: 0000000000000293 R12: 000055649ab67a00
[Mon Aug 28 21:08:39 2023] R13: 00007fbc42186510 R14: 00007fff834a14b0 R15: 00007fbc41987000
[Mon Aug 28 21:08:39 2023]  </TASK>
[Mon Aug 28 21:08:39 2023] ---[ end trace 0000000000000000 ]---
[Mon Aug 28 21:08:39 2023] ------------[ cut here ]------------
[Mon Aug 28 21:08:39 2023] We're out of sync: Incoming session entry [2607:f460:1833:bf:c436:763e:41b3:6064#46149, 64:ff9b::d85d:f612#3478, 72.10.119.12#55668, 216.93.246.18#3478, UDP] collides with DB entry [2607:f460:1833:bf:c436:763e:41b3:6064#46149, 64:ff9b::d85d:f612#3478, 72.

... additional stack traces ...

Several traces happened in a short time frame (dozens), but the IP/ports seem to be the same for most (I only count two distinct tuples between all the messages). After that it got quiet agin for a bit with just some more "Too many sessions deferred" messages.

I can send the full dmesg output for this brief test if that's helpful.

ydahhrk commented 1 year ago

Incoming session entry [...] collides with DB entry [...]

Just to be clear: This is not really strange, it just means the two instances did not start at the same time (an advertise wouldn't go amiss), or joold wasn't able to sync fast enough. I should probably reduce the severity of the message.

joold: Too many sessions deferred! I need to drop some; sorry.

This is indeed strange.

Also, none of this is consequence of the -DJKMEMLEAK flag (unless it's because it's slowing everything down). Did it print something when you removed the module?

suffieldacademy commented 1 year ago

Sorry, I didn't realize I had to unload the module. When I did, there wasn't anything in dmesg about memory, just the unloading messages:

[Wed Aug 30 07:50:42 2023] Jool: Deleted instance 'nat64-wkp-lower'.
[Wed Aug 30 07:50:42 2023] Jool: Deleted instance 'siit-dmz'.
[Wed Aug 30 07:50:42 2023] Jool: Deleted instance 'nat64-wkp-upper'.
[Wed Aug 30 07:50:51 2023] NAT64 Jool v4.1.10.0 module removed.
[Wed Aug 30 07:50:51 2023] SIIT Jool v4.1.10.0 module removed.
[Wed Aug 30 07:50:51 2023] Jool: Core Jool v4.1.10.0 module removed.

Is there somewhere else I should check?

ydahhrk commented 1 year ago

Is there somewhere else I should check?

No. Did you uninstall the Debian package? If both the Debian package and a custom Jool are installed, the former takes precedence.

sudo apt remove jool-dkms jool-tools

Note, if you uninstall the Debian package, you also lose the systemd unit file. If you're starting Jool via systemd, here are instructions to restore the unit files. Start with the curls, end with the daemon-reload. Mind your config files.

I uploaded a new commit to issue410. It's optional. It reduces the session desync message, and also bumps Jool's version number slightly, so you can more easily tell whether you're running the custom binary or the Debian package'd Jool:

$ make CFLAGS_MODULE="-DJKMEMLEAK"
$ sudo make install
$ sudo dmesg -C
$ sudo modprobe jool
$ sudo dmesg -ct
Jool: Core Jool v4.1.10.1 module inserted.
NAT64 Jool v4.1.10.1 module inserted.

ydahhrk commented 1 year ago

I'm still worried about you getting so many Too many sessions deferred! messages.

So new commit. Version number bumped to 4.1.10.2.

This one prints a few extra stats that might help us monitor kernelside joold's state:

$ sudo jool stat display --all | grep JSTAT_JOOLD
JSTAT_JOOLD_EMPTY: 63
JSTAT_JOOLD_TIMEOUT: 1
JSTAT_JOOLD_MISSING_ACK: 3
JSTAT_JOOLD_AD_ONGOING: 0
JSTAT_JOOLD_FLUSH_ASAP: 5
JSTAT_JOOLD_PKT_FULL: 0
JSTAT_JOOLD_QUEUING: 0
JSTAT_JOOLD_SSS_QUEUED: 6
JSTAT_JOOLD_SSS_SENT: 6
JSTAT_JOOLD_SSS_RCVD: 0
JSTAT_JOOLD_PKT_SENT: 6
JSTAT_JOOLD_PKT_RCVD: 0
JSTAT_JOOLD_ADS: 0
JSTAT_JOOLD_ACKS: 6

You can get their descriptions by adding --explain:

$ sudo jool stat display --all --explain
...

JSTAT_JOOLD_TIMEOUT: 1
Joold packet sent; ss-flush-deadline reached.

JSTAT_JOOLD_MISSING_ACK: 3
Joold packet not sent; still waiting for ACK.

...

suffieldacademy commented 1 year ago

OK, recompiled off that branch with the new commits. Been running for about 16 hours. I took a stats dump and then unloaded the module. Still not seeing any leak info in dmesg.

We have 3 instances (1 SIIT, 2 NAT64) running on the box; here are the stats from all:

# ip netns exec sa-jool-ns jool_siit -i siit-dmz stats display --all
JSTAT_RECEIVED6: 76039
JSTAT_RECEIVED4: 306243
JSTAT_SUCCESS: 381909
JSTAT_BIB_ENTRIES: 0
JSTAT_SESSIONS: 0
JSTAT_ENOMEM: 0
JSTAT_XLATOR_DISABLED: 0
JSTAT_POOL6_UNSET: 0
JSTAT_SKB_SHARED: 0
JSTAT_L3HDR_OFFSET: 0
JSTAT_SKB_TRUNCATED: 0
JSTAT_HDR6: 0
JSTAT_HDR4: 0
JSTAT_UNKNOWN_L4_PROTO: 0
JSTAT_UNKNOWN_ICMP6_TYPE: 0
JSTAT_UNKNOWN_ICMP4_TYPE: 8
JSTAT_DOUBLE_ICMP6_ERROR: 0
JSTAT_DOUBLE_ICMP4_ERROR: 0
JSTAT_UNKNOWN_PROTO_INNER: 0
JSTAT_HAIRPIN_LOOP: 0
JSTAT_POOL6_MISMATCH: 0
JSTAT_POOL4_MISMATCH: 0
JSTAT_ICMP6_FILTER: 0
JSTAT_UNTRANSLATABLE_DST6: 0
JSTAT_UNTRANSLATABLE_DST4: 0
JSTAT_6056_F: 0
JSTAT_MASK_DOMAIN_NOT_FOUND: 0
JSTAT_BIB6_NOT_FOUND: 0
JSTAT_BIB4_NOT_FOUND: 0
JSTAT_SESSION_NOT_FOUND: 0
JSTAT_ADF: 0
JSTAT_V4_SYN: 0
JSTAT_SYN6_EXPECTED: 0
JSTAT_SYN4_EXPECTED: 0
JSTAT_TYPE1PKT: 0
JSTAT_TYPE2PKT: 0
JSTAT_SO_EXISTS: 0
JSTAT_SO_FULL: 0
JSTAT64_SRC: 0
JSTAT64_DST: 0
JSTAT64_PSKB_COPY: 0
JSTAT64_6791_ENOENT: 0
JSTAT64_ICMP_CSUM: 0
JSTAT64_UNTRANSLATABLE_PARAM_PROB_PTR: 0
JSTAT64_TTL: 0
JSTAT64_FRAGMENTED_ICMP: 0
JSTAT64_2XFRAG: 0
JSTAT64_FRAG_THEN_EXT: 0
JSTAT64_SEGMENTS_LEFT: 0
JSTAT46_SRC: 0
JSTAT46_DST: 0
JSTAT46_PSKB_COPY: 0
JSTAT46_6791_ENOENT: 0
JSTAT46_ICMP_CSUM: 0
JSTAT46_UNTRANSLATABLE_PARAM_PROBLEM_PTR: 0
JSTAT46_TTL: 13
JSTAT46_FRAGMENTED_ICMP: 0
JSTAT46_SRC_ROUTE: 0
JSTAT46_FRAGMENTED_ZERO_CSUM: 352
JSTAT46_BAD_MTU: 0
JSTAT_FAILED_ROUTES: 0
JSTAT_PKT_TOO_BIG: 0
JSTAT_DST_OUTPUT: 0
JSTAT_ICMP6ERR_SUCCESS: 0
JSTAT_ICMP6ERR_FAILURE: 0
JSTAT_ICMP4ERR_SUCCESS: 365
JSTAT_ICMP4ERR_FAILURE: 0
JSTAT_ICMPEXT_BIG: 0
JSTAT_JOOLD_EMPTY: 0
JSTAT_JOOLD_TIMEOUT: 0
JSTAT_JOOLD_MISSING_ACK: 0
JSTAT_JOOLD_AD_ONGOING: 0
JSTAT_JOOLD_FLUSH_ASAP: 0
JSTAT_JOOLD_PKT_FULL: 0
JSTAT_JOOLD_QUEUING: 0
JSTAT_JOOLD_SSS_QUEUED: 0
JSTAT_JOOLD_SSS_SENT: 0
JSTAT_JOOLD_SSS_RCVD: 0
JSTAT_JOOLD_PKT_SENT: 0
JSTAT_JOOLD_PKT_RCVD: 0
JSTAT_JOOLD_ADS: 0
JSTAT_JOOLD_ACKS: 0
JSTAT_UNKNOWN: 0
# ip netns exec sa-jool-ns jool -i nat64-wkp-lower stats display --all
JSTAT_RECEIVED6: 22673034
JSTAT_RECEIVED4: 53629379
JSTAT_SUCCESS: 76179608
JSTAT_BIB_ENTRIES: 13646
JSTAT_SESSIONS: 13669
JSTAT_ENOMEM: 0
JSTAT_XLATOR_DISABLED: 0
JSTAT_POOL6_UNSET: 0
JSTAT_SKB_SHARED: 0
JSTAT_L3HDR_OFFSET: 0
JSTAT_SKB_TRUNCATED: 11
JSTAT_HDR6: 0
JSTAT_HDR4: 0
JSTAT_UNKNOWN_L4_PROTO: 26
JSTAT_UNKNOWN_ICMP6_TYPE: 0
JSTAT_UNKNOWN_ICMP4_TYPE: 0
JSTAT_DOUBLE_ICMP6_ERROR: 0
JSTAT_DOUBLE_ICMP4_ERROR: 0
JSTAT_UNKNOWN_PROTO_INNER: 0
JSTAT_HAIRPIN_LOOP: 0
JSTAT_POOL6_MISMATCH: 0
JSTAT_POOL4_MISMATCH: 9527
JSTAT_ICMP6_FILTER: 0
JSTAT_UNTRANSLATABLE_DST6: 0
JSTAT_UNTRANSLATABLE_DST4: 0
JSTAT_6056_F: 0
JSTAT_MASK_DOMAIN_NOT_FOUND: 0
JSTAT_BIB6_NOT_FOUND: 0
JSTAT_BIB4_NOT_FOUND: 19683
JSTAT_SESSION_NOT_FOUND: 194
JSTAT_ADF: 0
JSTAT_V4_SYN: 45656
JSTAT_SYN6_EXPECTED: 45549
JSTAT_SYN4_EXPECTED: 783
JSTAT_TYPE1PKT: 0
JSTAT_TYPE2PKT: 0
JSTAT_SO_EXISTS: 0
JSTAT_SO_FULL: 0
JSTAT64_SRC: 0
JSTAT64_DST: 0
JSTAT64_PSKB_COPY: 0
JSTAT64_6791_ENOENT: 0
JSTAT64_ICMP_CSUM: 0
JSTAT64_UNTRANSLATABLE_PARAM_PROB_PTR: 0
JSTAT64_TTL: 0
JSTAT64_FRAGMENTED_ICMP: 0
JSTAT64_2XFRAG: 0
JSTAT64_FRAG_THEN_EXT: 0
JSTAT64_SEGMENTS_LEFT: 0
JSTAT46_SRC: 0
JSTAT46_DST: 0
JSTAT46_PSKB_COPY: 0
JSTAT46_6791_ENOENT: 0
JSTAT46_ICMP_CSUM: 0
JSTAT46_UNTRANSLATABLE_PARAM_PROBLEM_PTR: 0
JSTAT46_TTL: 120
JSTAT46_FRAGMENTED_ICMP: 0
JSTAT46_SRC_ROUTE: 0
JSTAT46_FRAGMENTED_ZERO_CSUM: 0
JSTAT46_BAD_MTU: 0
JSTAT_FAILED_ROUTES: 1256
JSTAT_PKT_TOO_BIG: 0
JSTAT_DST_OUTPUT: 0
JSTAT_ICMP6ERR_SUCCESS: 0
JSTAT_ICMP6ERR_FAILURE: 0
JSTAT_ICMP4ERR_SUCCESS: 120
JSTAT_ICMP4ERR_FAILURE: 0
JSTAT_ICMPEXT_BIG: 0
JSTAT_JOOLD_EMPTY: 15405972
JSTAT_JOOLD_TIMEOUT: 23
JSTAT_JOOLD_MISSING_ACK: 60790827
JSTAT_JOOLD_AD_ONGOING: 0
JSTAT_JOOLD_FLUSH_ASAP: 28703736
JSTAT_JOOLD_PKT_FULL: 0
JSTAT_JOOLD_QUEUING: 0
JSTAT_JOOLD_SSS_QUEUED: 76164725
JSTAT_JOOLD_SSS_SENT: 74939867
JSTAT_JOOLD_SSS_RCVD: 138148403
JSTAT_JOOLD_PKT_SENT: 28703759
JSTAT_JOOLD_PKT_RCVD: 34168029
JSTAT_JOOLD_ADS: 0
JSTAT_JOOLD_ACKS: 28705795
JSTAT_UNKNOWN: 0
# ip netns exec sa-jool-ns jool -i nat64-wkp-upper stats display --all
JSTAT_RECEIVED6: 2036
JSTAT_RECEIVED4: 0
JSTAT_SUCCESS: 2036
JSTAT_BIB_ENTRIES: 1678
JSTAT_SESSIONS: 1679
JSTAT_ENOMEM: 0
JSTAT_XLATOR_DISABLED: 0
JSTAT_POOL6_UNSET: 0
JSTAT_SKB_SHARED: 0
JSTAT_L3HDR_OFFSET: 0
JSTAT_SKB_TRUNCATED: 0
JSTAT_HDR6: 0
JSTAT_HDR4: 0
JSTAT_UNKNOWN_L4_PROTO: 0
JSTAT_UNKNOWN_ICMP6_TYPE: 0
JSTAT_UNKNOWN_ICMP4_TYPE: 0
JSTAT_DOUBLE_ICMP6_ERROR: 0
JSTAT_DOUBLE_ICMP4_ERROR: 0
JSTAT_UNKNOWN_PROTO_INNER: 0
JSTAT_HAIRPIN_LOOP: 0
JSTAT_POOL6_MISMATCH: 0
JSTAT_POOL4_MISMATCH: 0
JSTAT_ICMP6_FILTER: 0
JSTAT_UNTRANSLATABLE_DST6: 0
JSTAT_UNTRANSLATABLE_DST4: 0
JSTAT_6056_F: 0
JSTAT_MASK_DOMAIN_NOT_FOUND: 0
JSTAT_BIB6_NOT_FOUND: 0
JSTAT_BIB4_NOT_FOUND: 0
JSTAT_SESSION_NOT_FOUND: 0
JSTAT_ADF: 0
JSTAT_V4_SYN: 0
JSTAT_SYN6_EXPECTED: 0
JSTAT_SYN4_EXPECTED: 0
JSTAT_TYPE1PKT: 0
JSTAT_TYPE2PKT: 0
JSTAT_SO_EXISTS: 0
JSTAT_SO_FULL: 0
JSTAT64_SRC: 0
JSTAT64_DST: 0
JSTAT64_PSKB_COPY: 0
JSTAT64_6791_ENOENT: 0
JSTAT64_ICMP_CSUM: 0
JSTAT64_UNTRANSLATABLE_PARAM_PROB_PTR: 0
JSTAT64_TTL: 0
JSTAT64_FRAGMENTED_ICMP: 0
JSTAT64_2XFRAG: 0
JSTAT64_FRAG_THEN_EXT: 0
JSTAT64_SEGMENTS_LEFT: 0
JSTAT46_SRC: 0
JSTAT46_DST: 0
JSTAT46_PSKB_COPY: 0
JSTAT46_6791_ENOENT: 0
JSTAT46_ICMP_CSUM: 0
JSTAT46_UNTRANSLATABLE_PARAM_PROBLEM_PTR: 0
JSTAT46_TTL: 0
JSTAT46_FRAGMENTED_ICMP: 0
JSTAT46_SRC_ROUTE: 0
JSTAT46_FRAGMENTED_ZERO_CSUM: 0
JSTAT46_BAD_MTU: 0
JSTAT_FAILED_ROUTES: 0
JSTAT_PKT_TOO_BIG: 0
JSTAT_DST_OUTPUT: 0
JSTAT_ICMP6ERR_SUCCESS: 0
JSTAT_ICMP6ERR_FAILURE: 0
JSTAT_ICMP4ERR_SUCCESS: 0
JSTAT_ICMP4ERR_FAILURE: 0
JSTAT_ICMPEXT_BIG: 0
JSTAT_JOOLD_EMPTY: 28665501
JSTAT_JOOLD_TIMEOUT: 2035
JSTAT_JOOLD_MISSING_ACK: 0
JSTAT_JOOLD_AD_ONGOING: 0
JSTAT_JOOLD_FLUSH_ASAP: 1
JSTAT_JOOLD_PKT_FULL: 0
JSTAT_JOOLD_QUEUING: 0
JSTAT_JOOLD_SSS_QUEUED: 2036
JSTAT_JOOLD_SSS_SENT: 2036
JSTAT_JOOLD_SSS_RCVD: 138965356
JSTAT_JOOLD_PKT_SENT: 2036
JSTAT_JOOLD_PKT_RCVD: 34271807
JSTAT_JOOLD_ADS: 0
JSTAT_JOOLD_ACKS: 28635463
JSTAT_UNKNOWN: 0

[Thu Aug 31 14:19:44 2023] Jool: Deleted instance 'nat64-wkp-upper'.
[Thu Aug 31 14:19:44 2023] Jool: Deleted instance 'nat64-wkp-lower'.
[Thu Aug 31 14:19:44 2023] Jool: Deleted instance 'siit-dmz'.
[Thu Aug 31 14:19:52 2023] NAT64 Jool v4.1.10.2 module removed.
[Thu Aug 31 14:19:53 2023] SIIT Jool v4.1.10.2 module removed.
[Thu Aug 31 14:19:53 2023] Jool: Core Jool v4.1.10.2 module removed.
[Thu Aug 31 14:19:53 2023] ixgbe 0000:05:00.0: removed PHC on enp5s0f0
[Thu Aug 31 14:19:53 2023] ixgbe 0000:05:00.1: removed PHC on enp5s0f1
[Thu Aug 31 14:19:53 2023] device eno3 left promiscuous mode
[Thu Aug 31 14:19:53 2023] device eno4 left promiscuous mode

However, I'm not sure if the numbers for any of these counters are alarming. Anything I should try to drill down on?

ydahhrk commented 1 year ago

We have 3 instances (1 SIIT, 2 NAT64) running on the box; here are the stats from all:

Waitwaitwaitwaitwait. WhAt iS HaPpEnInG?!?

Why is there an SIIT in your setup and why do you think it's relevant?
What do you mean "on the box"? Are all three instances running on the same machine?

The numbers more or less tell me that both NAT64 instances are receiving ALL the joold packets; even the ones they themselves multicasted. This is backwards.

And it seems to confirm the theory that they're running on the same machine. They're subscribed to literally the exact same Netlink multicast group... in the same kernel.

We're going to have to zoom out for a sec, and scrutinize your setup. Please draw your network, provide your configuration, and explain what you're trying to do. (If you need to obfuscate your IPs, please obfuscate them consistently.)

The entire point of Active/Active is load balancing. You don't get anything out of the load balancing if both artifacts are running on the same processor. If all you want is a backup instance in case one of them crashes, PLEASE do Active/Passive, and also, place the instances in different (at least virtual) machines.

Jool is a kernel module, which means that if one of them crash, the entire system will go down with it. (And that includes the other Jool, if it's on the same system.)

Still not seeing any leak info in dmesg.

You don't need to wait 16 hours to find out if you did this correctly.

Just compile the module with -DJKMEMLEAK, reinstall it, then quickly insert and remove it, then immediately query dmesg:

$ make CFLAGS_MODULE="-DJKMEMLEAK"
$ sudo make install
$ sudo modprobe jool && sudo modprobe -r jool; sudo dmesg -ct
Jool: Core Jool v4.1.10.2 module inserted.
NAT64 Jool v4.1.10.2 module inserted.
NAT64 Jool v4.1.10.2 module removed.
Jool: Memory leaks:
Jool: None.
Jool: Core Jool v4.1.10.2 module removed.

If you don't see the "Memory leaks:" line right away, something went wrong.

suffieldacademy commented 1 year ago

I realize I haven't explained the architecture and it sounds like I'm crazy. I'll DM you the details on the network as it's more than I can easily fit here (and I don't want to post a full net diagram). It probably is crazy, but hopefully not in a breaking sort of way.

For now, with regards to the module, I think the issue is that I'm installing via dkms. When I do the "make" that must only be userspace (make install doesn't change what's in /lib/modules). How do I pass the MEMLEAK flags to dkms?

ydahhrk commented 12 months ago

How do I pass the MEMLEAK flags to dkms?

One way is to add -DJKMEMLEAK to ccflags-y at src/mod/common/Kbuild:

ccflags-y := -I$(src)/../.. -DJKMEMLEAK

suffieldacademy commented 12 months ago

Thank you, that is what we were missing! I did a rebuild with a module load/unload and am now seeing the message:

[Sun Sep 10 14:19:06 2023] NAT64 Jool v4.1.10.3 module removed.
[Sun Sep 10 14:19:06 2023] SIIT Jool v4.1.10.3 module removed.
[Sun Sep 10 14:19:06 2023] Jool: Memory leaks:
[Sun Sep 10 14:19:06 2023] Jool: None.
[Sun Sep 10 14:19:06 2023] Jool: Core Jool v4.1.10.3 module removed.

I will keep gathering stats with the new commits from yesterday.

suffieldacademy commented 11 months ago

Been a busy week... unloaded the Jool modules on the box with the patches to check for memory leaks, and it looks like there are some:

[Fri Sep 15 14:11:25 2023] NAT64 Jool v4.1.10.3 module removed.
[Fri Sep 15 14:11:25 2023] SIIT Jool v4.1.10.3 module removed.
[Fri Sep 15 14:11:26 2023] Jool: Memory leaks:
[Fri Sep 15 14:11:26 2023] - joold session: 18223133
[Fri Sep 15 14:11:26 2023] ------------[ cut here ]------------
[Fri Sep 15 14:11:26 2023] Total leaks: 1
[Fri Sep 15 14:11:26 2023] WARNING: CPU: 0 PID: 8251 at /var/lib/dkms/jool/4.1.10.git.v4.1.10.7.g4fcfe184/build/src/mod/common/wkmalloc.c:111 wkmalloc_print_leaks+0x6f/0x78 [jool_common]
[Fri Sep 15 14:11:26 2023] Modules linked in: nft_compat nf_tables libcrc32c nfnetlink jool_common(OE-) hsr(OE) binfmt_misc ipmi_ssif nls_ascii nls_cp437 vfat intel_rapl_msr intel_rapl_common joydev sb_edac fat x86_pkg_temp_thermal intel_powerclamp coretemp hid_generic ghash_clmulni_intel usbhid sha512_ssse3 sha512_generic mgag200 wdat_wdt hid drm_shmem_helper aesni_intel watchdog drm_kms_helper crypto_simd cryptd dcdbas rapl evdev intel_cstate i2c_algo_bit ipmi_si intel_uncore ipmi_devintf ipmi_msghandler pcspkr acpi_power_meter button acpi_pad mei_me sg mei ioatdma drm efi_pstore loop fuse configfs efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic dm_mirror dm_region_hash dm_log dm_mod sd_mod t10_pi sr_mod crc64_rocksoft cdrom crc64 crc_t10dif crct10dif_generic ahci crct10dif_pclmul crct10dif_common crc32_pclmul libahci ehci_pci crc32c_intel ehci_hcd libata ixgbe usbcore lpc_ich megaraid_sas usb_common xfrm_algo dca mdio_devres mdio tg3 scsi_mod libphy scsi_common ptp
[Fri Sep 15 14:11:26 2023]  pps_core wmi [last unloaded: jool_siit(OE)]
[Fri Sep 15 14:11:26 2023] CPU: 0 PID: 8251 Comm: modprobe Tainted: G           OE      6.1.0-11-amd64 #1  Debian 6.1.38-4
[Fri Sep 15 14:11:26 2023] Hardware name: Dell Inc. PowerEdge R620/01W23F, BIOS 2.7.0 05/23/2018
[Fri Sep 15 14:11:26 2023] RIP: 0010:wkmalloc_print_leaks+0x6f/0x78 [jool_common]
[Fri Sep 15 14:11:26 2023] Code: e8 e2 0d 08 d3 48 89 c3 eb d5 85 ed 75 0e 5b 48 c7 c7 a2 d7 b4 c0 5d e9 e0 df 09 d3 89 ee 48 c7 c7 b1 d7 b4 c0 e8 3d 7f 75 d2 <0f> 0b 5b 5d c3 cc cc cc cc 48 c7 c7 68 89 b4 c0 e8 bd df 09 d3 48
[Fri Sep 15 14:11:26 2023] RSP: 0018:ffffb08a800a7e18 EFLAGS: 00010282
[Fri Sep 15 14:11:26 2023] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[Fri Sep 15 14:11:26 2023] RDX: 0000000000000002 RSI: ffffffff9454191e RDI: 00000000ffffffff
[Fri Sep 15 14:11:26 2023] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffb08a800a7c88
[Fri Sep 15 14:11:26 2023] R10: 0000000000000003 R11: ffff8a77aff3b118 R12: 0000000000000800
[Fri Sep 15 14:11:26 2023] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[Fri Sep 15 14:11:26 2023] FS:  00007febcea8d040(0000) GS:ffff8a77ab200000(0000) knlGS:0000000000000000
[Fri Sep 15 14:11:26 2023] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Sep 15 14:11:26 2023] CR2: 00007febce467b20 CR3: 0000000105510006 CR4: 00000000000606f0
[Fri Sep 15 14:11:26 2023] Call Trace:
[Fri Sep 15 14:11:26 2023]  <TASK>
[Fri Sep 15 14:11:26 2023]  ? __warn+0x7d/0xc0
[Fri Sep 15 14:11:26 2023]  ? wkmalloc_print_leaks+0x6f/0x78 [jool_common]
[Fri Sep 15 14:11:26 2023]  ? report_bug+0xe6/0x170
[Fri Sep 15 14:11:26 2023]  ? handle_bug+0x41/0x70
[Fri Sep 15 14:11:26 2023]  ? exc_invalid_op+0x13/0x60
[Fri Sep 15 14:11:26 2023]  ? asm_exc_invalid_op+0x16/0x20
[Fri Sep 15 14:11:26 2023]  ? wkmalloc_print_leaks+0x6f/0x78 [jool_common]
[Fri Sep 15 14:11:26 2023]  ? wkmalloc_print_leaks+0x6f/0x78 [jool_common]
[Fri Sep 15 14:11:26 2023]  jool_exit+0x44/0x204 [jool_common]
[Fri Sep 15 14:11:26 2023]  __do_sys_delete_module+0x1d5/0x320
[Fri Sep 15 14:11:26 2023]  do_syscall_64+0x5b/0xc0
[Fri Sep 15 14:11:26 2023]  ? fpregs_assert_state_consistent+0x22/0x50
[Fri Sep 15 14:11:26 2023]  ? exit_to_user_mode_prepare+0x40/0x1d0
[Fri Sep 15 14:11:26 2023]  ? syscall_exit_to_user_mode+0x17/0x40
[Fri Sep 15 14:11:26 2023]  ? do_syscall_64+0x67/0xc0
[Fri Sep 15 14:11:26 2023]  ? do_syscall_64+0x67/0xc0
[Fri Sep 15 14:11:26 2023]  ? exit_to_user_mode_prepare+0x40/0x1d0
[Fri Sep 15 14:11:26 2023]  entry_SYSCALL_64_after_hwframe+0x69/0xd3
[Fri Sep 15 14:11:26 2023] RIP: 0033:0x7febce528887
[Fri Sep 15 14:11:26 2023] Code: 73 01 c3 48 8b 0d 79 85 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 49 85 0c 00 f7 d8 64 89 01 48
[Fri Sep 15 14:11:26 2023] RSP: 002b:00007fff52608b28 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[Fri Sep 15 14:11:26 2023] RAX: ffffffffffffffda RBX: 00005621a719fd20 RCX: 00007febce528887
[Fri Sep 15 14:11:26 2023] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005621a719fd88
[Fri Sep 15 14:11:26 2023] RBP: 0000000000000000 R08: 1999999999999999 R09: 0000000000000000
[Fri Sep 15 14:11:26 2023] R10: 00007febce59bac0 R11: 0000000000000206 R12: 0000000000000000
[Fri Sep 15 14:11:26 2023] R13: 0000000000000000 R14: 00007fff52608b60 R15: 00007fff5260a018
[Fri Sep 15 14:11:26 2023]  </TASK>
[Fri Sep 15 14:11:26 2023] ---[ end trace 0000000000000000 ]---
[Fri Sep 15 14:11:26 2023] Jool: Core Jool v4.1.10.3 module removed.
[Fri Sep 15 14:11:26 2023] ixgbe 0000:05:00.0: removed PHC on enp5s0f0
[Fri Sep 15 14:11:26 2023] ixgbe 0000:05:00.1: removed PHC on enp5s0f1
[Fri Sep 15 14:11:26 2023] device eno3 left promiscuous mode
[Fri Sep 15 14:11:26 2023] device eno4 left promiscuous mode

Not quite sure how to interpret the numbers there; I'm missing at least 1GB of memory compared to a freshly-rebooted machine. I know some of that is legitimate state data but there must be a fair amount of leak there too.

ydahhrk commented 11 months ago

Thanks, that narrows it down.

New handful of commits in issue410. Hopefully, these will be the last ones.

- joold session: 18223133

Assuming the memory leak I just patched is the only one, all of these sessions are getting dropped because of ss-capacity. I'm assuming you're still getting this warning a lot?

joold: Too many sessions deferred! I need to drop some; sorry.

I hope you still have ss-flush-asap enabled. The number should decrease with some buffering.

Of course you should no longer see the number in the memory leaks list, so I made a new stat counter for it:

$ jool st d --all --explain
...
JSTAT_JOOLD_SSS_ENOSPC: 0
Joold: Total sessions dropped because the queue was full.
...

... That said, those are really a lot of dropped sessions. I'm not sure the buffering will enough as it is. At the risk of some portability trouble, I compressed the session object further. Please review your ss-max-sessions-per-packet.

suffieldacademy commented 11 months ago

OK, I've applied the patch to both boxes and confirmed that ss-flush-asap is still on (it was). Had a little hiccup as the change in session format meant that my two boxes weren't syncing but now that I've applied the patch on both things look normal again.

I've changed ss-max-sessions-per-packet from 8 to 26 (I have a slightly lower MTU of 1494 so playing it safe).

I am still getting Too Many Sessions errors, but not nearly as many as previously. I'll leave this running for a bit, and then you can let me know what stats you'd like me to collect.

suffieldacademy commented 11 months ago

Just some intermediate data after almost 24 hours of live traffic. Looks like a decent number of ENOSPC, but we are running with flush-asap. Let me know if we should make adjustments to that.

# Node A, lower instance (primary)
JSTAT_RECEIVED6: 112358877
JSTAT_RECEIVED4: 322701048
JSTAT_SUCCESS: 433935796
JSTAT_BIB_ENTRIES: 169500
JSTAT_SESSIONS: 171417
JSTAT_SKB_TRUNCATED: 763
JSTAT_UNKNOWN_L4_PROTO: 49
JSTAT_UNKNOWN_ICMP4_TYPE: 1
JSTAT_DOUBLE_ICMP4_ERROR: 2
JSTAT_POOL4_MISMATCH: 15614
JSTAT_BIB4_NOT_FOUND: 273583
JSTAT_SESSION_NOT_FOUND: 603
JSTAT_V4_SYN: 618203
JSTAT_SYN6_EXPECTED: 197485
JSTAT_SYN4_EXPECTED: 13397
JSTAT64_TTL: 20
JSTAT46_TTL: 311
JSTAT_FAILED_ROUTES: 4097
JSTAT_ICMP6ERR_SUCCESS: 20
JSTAT_ICMP4ERR_SUCCESS: 311
JSTAT_ICMPEXT_BIG: 1
JSTAT_JOOLD_EMPTY: 56680639
JSTAT_JOOLD_TIMEOUT: 4
JSTAT_JOOLD_MISSING_ACK: 377197718
JSTAT_JOOLD_AD_ONGOING: 464
JSTAT_JOOLD_FLUSH_ASAP: 121534269
JSTAT_JOOLD_SSS_QUEUED: 433837426
JSTAT_JOOLD_SSS_SENT: 432942110
JSTAT_JOOLD_SSS_RCVD: 894556
JSTAT_JOOLD_SSS_ENOSPC: 906119
JSTAT_JOOLD_PKT_SENT: 121534737
JSTAT_JOOLD_PKT_RCVD: 267727
JSTAT_JOOLD_ADS: 1
JSTAT_JOOLD_ACKS: 121534735

# Node A, upper instance (secondary)
JSTAT_RECEIVED6: 31239
JSTAT_RECEIVED4: 258163
JSTAT_SUCCESS: 288495
JSTAT_BIB_ENTRIES: 45781
JSTAT_SESSIONS: 47276
JSTAT_SKB_TRUNCATED: 2
JSTAT_POOL4_MISMATCH: 13
JSTAT_BIB4_NOT_FOUND: 91
JSTAT_V4_SYN: 460
JSTAT_SYN6_EXPECTED: 273
JSTAT_SYN4_EXPECTED: 68
JSTAT_JOOLD_EMPTY: 63831
JSTAT_JOOLD_TIMEOUT: 2766
JSTAT_JOOLD_MISSING_ACK: 265569
JSTAT_JOOLD_AD_ONGOING: 123
JSTAT_JOOLD_FLUSH_ASAP: 47156
JSTAT_JOOLD_SSS_QUEUED: 288469
JSTAT_JOOLD_SSS_SENT: 272340
JSTAT_JOOLD_SSS_RCVD: 266024146
JSTAT_JOOLD_SSS_ENOSPC: 19310
JSTAT_JOOLD_PKT_SENT: 50045
JSTAT_JOOLD_PKT_RCVD: 66751438
JSTAT_JOOLD_ADS: 1
JSTAT_JOOLD_ACKS: 50043

# Node B, lower instance (secondary)
JSTAT_RECEIVED6: 2748
JSTAT_SUCCESS: 2748
JSTAT_BIB_ENTRIES: 35047
JSTAT_SESSIONS: 36533
JSTAT_JOOLD_EMPTY: 43623
JSTAT_JOOLD_TIMEOUT: 2748
JSTAT_JOOLD_SSS_QUEUED: 2748
JSTAT_JOOLD_SSS_SENT: 2748
JSTAT_JOOLD_SSS_RCVD: 815461236
JSTAT_JOOLD_PKT_SENT: 2748
JSTAT_JOOLD_PKT_RCVD: 236777730
JSTAT_JOOLD_ACKS: 2748

# Node B, upper instance (primary)
JSTAT_RECEIVED6: 22783023
JSTAT_RECEIVED4: 110258968
JSTAT_SUCCESS: 132814422
JSTAT_BIB_ENTRIES: 45715
JSTAT_SESSIONS: 47212
JSTAT_SKB_TRUNCATED: 309
JSTAT_UNKNOWN_L4_PROTO: 60
JSTAT_UNKNOWN_ICMP4_TYPE: 20
JSTAT_POOL4_MISMATCH: 17433
JSTAT_BIB4_NOT_FOUND: 24279
JSTAT_SESSION_NOT_FOUND: 3353
JSTAT_V4_SYN: 112661
JSTAT_SYN6_EXPECTED: 67546
JSTAT_SYN4_EXPECTED: 1852
JSTAT_FAILED_ROUTES: 51
JSTAT_ICMPEXT_BIG: 5
JSTAT_JOOLD_EMPTY: 15757770
JSTAT_JOOLD_TIMEOUT: 967
JSTAT_JOOLD_MISSING_ACK: 117064931
JSTAT_JOOLD_FLUSH_ASAP: 33277386
JSTAT_JOOLD_SSS_QUEUED: 132781826
JSTAT_JOOLD_SSS_SENT: 132651181
JSTAT_JOOLD_SSS_RCVD: 410186
JSTAT_JOOLD_SSS_ENOSPC: 130645
JSTAT_JOOLD_PKT_SENT: 33278353
JSTAT_JOOLD_PKT_RCVD: 79288
JSTAT_JOOLD_ACKS: 33278353

ydahhrk commented 11 months ago

Hmmm. Well, some things that jump at me:

# Node A, lower instance (primary)
JSTAT_JOOLD_SSS_SENT: 432,942,110
JSTAT_JOOLD_SSS_RCVD: 894,556

# Node B, lower instance (secondary)
JSTAT_JOOLD_SSS_SENT: 2,748
JSTAT_JOOLD_SSS_RCVD: 815,461,236

Sent 432m/2k, received 815m/894k? Doesn't add up much.

The other pair is in a similar situation:

# Node B, upper instance (primary)
JSTAT_JOOLD_SSS_SENT: 132,651,181
JSTAT_JOOLD_SSS_RCVD: 410,186

# Node A, upper instance (secondary)
JSTAT_JOOLD_SSS_SENT: 272,340
JSTAT_JOOLD_SSS_RCVD: 266,024,146

Also, Node A, upper instance (secondary) seems to be sending a lot of sessions despite being secondary.

I mean you did issue an advertise, but that should have been 123 * 26 = 3198 sessions, but JSTAT_JOOLD_SSS_SENT is 272k...

Some ideas:

Have you tried grepping for sessions that shouldn't be there? If you use --csv, Jool prints each session in a single line:

$ jool se d --tcp --numeric --csv --no-headers
TCP,2001:db8:1::5,1001,64:ff9b::c000:202,1002,192.0.2.2,1000,192.0.2.2,1002,0:03:02.264
TCP,2001:db8:1::5,1001,64:ff9b::c000:205,4000,192.0.2.2,1000,192.0.2.5,4000,0:03:02
TCP,2001:db8::5,1003,64:ff9b::c000:202,1000,192.0.2.2,1002,192.0.2.2,1000,0:03:02.264
TCP,2001:db8::5,2000,64:ff9b::c000:205,2000,192.0.2.2,2000,192.0.2.5,2000,0:03:01.856
TCP,2001:db8::5,2000,64:ff9b::c000:205,4000,192.0.2.2,2000,192.0.2.5,4000,0:03:19.580
TCP,2001:db8::5,2000,64:ff9b::cb00:7118,4000,192.0.2.2,2000,203.0.113.24,4000,0:03:02.128

Try, for example, looking for an upper session in the lower instances:

$ jool -i nat64-wkp-lower se d --tcp --numeric --csv --no-headers | grep -F "XXX.YYY.ZZZ.8/29"

Note, this only prints the TCP table. --udp and --icmp will yield more sessions.

You can also use diff to find sessions that exist in one instance but not the other:

$ ssh NodeA jool -i nat64-wkp-lower se d --tcp --numeric --csv --no-headers | # Get the output from Node A
    cut --complement -d , -f 10 | # Remove the expiration column, because it's noise for the diff
    sort > nodeA-lower.csv # Sort alphabetically, dump to file
$ cat nodeA-lower.csv
TCP,2001:db8:1::5,1001,64:ff9b::c000:202,1002,192.0.2.2,1000,192.0.2.2,1002
TCP,2001:db8:1::5,1001,64:ff9b::c000:205,4000,192.0.2.2,1000,192.0.2.5,4000
TCP,2001:db8::5,1003,64:ff9b::c000:202,1000,192.0.2.2,1002,192.0.2.2,1000
TCP,2001:db8::5,2000,64:ff9b::c000:205,2000,192.0.2.2,2000,192.0.2.5,2000
TCP,2001:db8::5,2000,64:ff9b::c000:205,4000,192.0.2.2,2000,192.0.2.5,4000
TCP,2001:db8::5,2000,64:ff9b::cb00:7118,4000,192.0.2.2,2000,203.0.113.24,4000
$ 
$ ssh NodeB jool -i nat64-wkp-lower se d --tcp --numeric --csv --no-headers |
    cut --complement -d , -f 10 |
    sort > nodeB-lower.csv
$ cat nodeB-lower.csv
TCP,2001:db8:1::5,1001,64:ff9b::c000:202,1002,192.0.2.2,1000,192.0.2.2,1002
TCP,2001:db8:1::5,1001,64:ff9b::c000:205,4000,192.0.2.2,1000,192.0.2.5,4000
TCP,2001:db8::5,1003,64:ff9b::c000:202,1000,192.0.2.2,1002,192.0.2.2,1000
TCP,2001:db8::5,2000,64:ff9b::c000:205,4000,192.0.2.2,2000,192.0.2.5,4000
TCP,2001:db8::5,2000,64:ff9b::cb00:7118,4000,192.0.2.2,2000,203.0.113.24,4000
TCP,2002:db8::5,2000,64:ff9b::c000:205,2000,192.0.2.2,2000,192.0.2.5,2000
$ 
$ diff nodeA-lower.csv nodeB-lower.csv
4d3
< TCP,2001:db8::5,2000,64:ff9b::c000:205,2000,192.0.2.2,2000,192.0.2.5,2000
6a6
> TCP,2002:db8::5,2000,64:ff9b::c000:205,2000,192.0.2.2,2000,192.0.2.5,2000

I imagine you'd get lots of differences because of the ss-flush-asap dropped sessions, but feel free to disable it and restart if you're no longer getting memory leaks.

Also, Node A, upper instance (secondary) seems to be sending a lot of sessions despite being secondary.

Try enabling debug on this instance. Is it translating packets? Why? Which ones?

suffieldacademy commented 11 months ago

OK, I think I figured out the packet disparity. I had a loop in my topology for the backend network that the session sync was running over. It was a redundant link but I didn't realize that multicast packets were arriving twice (unicast packets were correctly being filtered). So that's why the send/receive were so far off from each other (double).

I've taken out one of the links and now it looks like the send/receive rates are more closely matched. (~1 million sent == ~1 million received).

Other good news: when I unloaded the module to reset some counters, I saw that there were no leaks:

[Wed Sep 20 14:08:26 2023] NAT64 Jool v4.1.10.4 module removed.
[Wed Sep 20 14:08:26 2023] SIIT Jool v4.1.10.4 module removed.
[Wed Sep 20 14:08:26 2023] Jool: Memory leaks:
[Wed Sep 20 14:08:26 2023] Jool: None.
[Wed Sep 20 14:08:26 2023] Jool: Core Jool v4.1.10.4 module removed.

I think the inflated values from the secondary are due to the failover; every time I stop the daemon on one host the other immediately gets the traffic, so the counters can get a boost during the time when I'm restarting the other box.

Is there a way to clear the counters from the command-line? If so, I could zero them out once everything is stable which might make trend reading a little easier.

ydahhrk commented 11 months ago

Is there a way to clear the counters from the command-line? If so, I could zero them out once everything is stable which might make trend reading a little easier.

Hmmm... no. It seems the API doesn't want me to do that.

Try printing the stats with --csv (and --all to align them) before and after, load them into a spreadsheet, and subtract the corresponding values when you need to analyze them.

suffieldacademy commented 11 months ago

No worries, sorry for the delay, been very busy here. However, Jool has been stable (memory usage is holding steady). I took a peek at the stats and things looked somewhat reasonable (send/receive lined up within an order of magnitude). Thus, I'm restarting with ASAP turned off so it's more of an active/passive setup the way we intended. I'm going to watch the counters on those.

I have already seen a few "too many sessions" errors just while writing this message, so I'll let it run for a bit and we can see how prevalent they are under normal load.

suffieldacademy commented 10 months ago

After running for a few days with ASAP turned off memory still looks good. I am getting a lot of the "Too many sessions deferred!", often every 1-5 minutes. I'm even seeing those messages at "slow" times (middle of the night) which is a little surprising.

Attached are stats from the machines so we can get a feel for the drops. I can dig into the sessions if you want but I wanted to start with the counters. If the simple fact is that we're dropping too many sessions then going to the tables probably won't help.

Just as a reminder, there are two nodes (A/B) and each node is running two instances (lower/upper). A is the primary for lower, B is the primary for upper, and they are secondaries for the opposite instances.

# node a (primary) ip netns exec sa-jool-ns jool -i nat64-wkp-lower stats display
JSTAT_RECEIVED6: 280413350
JSTAT_RECEIVED4: 784028290
JSTAT_SUCCESS: 1061718873
JSTAT_BIB_ENTRIES: 32144
JSTAT_SESSIONS: 32231
JSTAT_SKB_TRUNCATED: 126
JSTAT_UNKNOWN_L4_PROTO: 239
JSTAT_UNKNOWN_ICMP4_TYPE: 17
JSTAT_UNKNOWN_PROTO_INNER: 2
JSTAT_POOL4_MISMATCH: 50714
JSTAT_BIB4_NOT_FOUND: 683468
JSTAT_SESSION_NOT_FOUND: 3304
JSTAT_V4_SYN: 1436978
JSTAT_SYN6_EXPECTED: 514116
JSTAT_SYN4_EXPECTED: 14593
JSTAT64_TTL: 115
JSTAT46_ICMP_CSUM: 1
JSTAT46_TTL: 995
JSTAT_FAILED_ROUTES: 18099
JSTAT_ICMP6ERR_SUCCESS: 115
JSTAT_ICMP4ERR_SUCCESS: 995
JSTAT_JOOLD_EMPTY: 6035325
JSTAT_JOOLD_TIMEOUT: 4
JSTAT_JOOLD_MISSING_ACK: 370972135
JSTAT_JOOLD_AD_ONGOING: 628
JSTAT_JOOLD_PKT_FULL: 40749942
JSTAT_JOOLD_QUEUING: 684619032
JSTAT_JOOLD_SSS_QUEUED: 1061518827
JSTAT_JOOLD_SSS_SENT: 1059514818
JSTAT_JOOLD_SSS_RCVD: 1134508
JSTAT_JOOLD_SSS_ENOSPC: 2020195
JSTAT_JOOLD_PKT_SENT: 40750574
JSTAT_JOOLD_PKT_RCVD: 283357
JSTAT_JOOLD_ADS: 1
JSTAT_JOOLD_ACKS: 40750574
# node a (secondary) ip netns exec sa-jool-ns jool -i nat64-wkp-upper stats display
JSTAT_RECEIVED6: 59115
JSTAT_RECEIVED4: 135169
JSTAT_SUCCESS: 193921
JSTAT_BIB_ENTRIES: 8827
JSTAT_SESSIONS: 8878
JSTAT_POOL4_MISMATCH: 16
JSTAT_BIB4_NOT_FOUND: 37
JSTAT_V4_SYN: 130
JSTAT_SYN6_EXPECTED: 129
JSTAT_SYN4_EXPECTED: 50
JSTAT_FAILED_ROUTES: 1
JSTAT_JOOLD_EMPTY: 116442
JSTAT_JOOLD_TIMEOUT: 7313
JSTAT_JOOLD_MISSING_ACK: 51379
JSTAT_JOOLD_AD_ONGOING: 201
JSTAT_JOOLD_PKT_FULL: 7160
JSTAT_JOOLD_QUEUING: 133754
JSTAT_JOOLD_SSS_QUEUED: 193910
JSTAT_JOOLD_SSS_SENT: 198708
JSTAT_JOOLD_SSS_RCVD: 364605831
JSTAT_JOOLD_SSS_ENOSPC: 394
JSTAT_JOOLD_PKT_SENT: 14674
JSTAT_JOOLD_PKT_RCVD: 84090768
JSTAT_JOOLD_ADS: 1
JSTAT_JOOLD_ACKS: 14674
# node b (secondary) ip netns exec sa-jool-ns jool -i nat64-wkp-lower stats display
JSTAT_RECEIVED6: 7311
JSTAT_SUCCESS: 7311
JSTAT_BIB_ENTRIES: 32644
JSTAT_SESSIONS: 32738
JSTAT_JOOLD_EMPTY: 114888
JSTAT_JOOLD_TIMEOUT: 7311
JSTAT_JOOLD_SSS_QUEUED: 7311
JSTAT_JOOLD_SSS_SENT: 7311
JSTAT_JOOLD_SSS_RCVD: 1055034500
JSTAT_JOOLD_PKT_SENT: 7311
JSTAT_JOOLD_PKT_RCVD: 40578250
JSTAT_JOOLD_ACKS: 7311
# node b (primary) ip netns exec sa-jool-ns jool -i nat64-wkp-upper stats display
JSTAT_RECEIVED6: 55184324
JSTAT_RECEIVED4: 311648811
JSTAT_SUCCESS: 366416915
JSTAT_BIB_ENTRIES: 8766
JSTAT_SESSIONS: 8815
JSTAT_SKB_TRUNCATED: 24
JSTAT_UNKNOWN_L4_PROTO: 202
JSTAT_POOL4_MISMATCH: 47936
JSTAT_BIB4_NOT_FOUND: 45382
JSTAT_SESSION_NOT_FOUND: 5035
JSTAT_V4_SYN: 275800
JSTAT_SYN6_EXPECTED: 38545
JSTAT_SYN4_EXPECTED: 2464
JSTAT_FAILED_ROUTES: 831
JSTAT_ICMPEXT_BIG: 1
JSTAT_JOOLD_EMPTY: 40365530
JSTAT_JOOLD_TIMEOUT: 422
JSTAT_JOOLD_MISSING_ACK: 326111867
JSTAT_JOOLD_FLUSH_ASAP: 84257643
JSTAT_JOOLD_SSS_QUEUED: 366369820
JSTAT_JOOLD_SSS_SENT: 365423209
JSTAT_JOOLD_SSS_RCVD: 114673
JSTAT_JOOLD_SSS_ENOSPC: 946611
JSTAT_JOOLD_PKT_SENT: 84258065
JSTAT_JOOLD_PKT_RCVD: 11414
JSTAT_JOOLD_ACKS: 84258065

gehoern commented 10 months ago

On our side the memory leaking is still persistent. We have the latest 4.1.10.4 brought online and we are still losing pages in slab (slabtop): 1052128 1052128 100% 0.25K 32879 32 263032K kmalloc-256

those pages are 0 when jool is not started. As soon as we start jool and heavily use it the kmalloc-256 increases but never drops (it also does not drop if we remove jool.ko / jool_common.ko or reconfigure it)

1110464 1110464 100% 0.25K 34702 32 277616K kmalloc-256

(both outputs from slabtop are taken with 3min distance -> ergo our box looses every 3 mins 14MB of memory if we use jool!)

ydahhrk commented 10 months ago

I can't debug this right now; I'm going to have to postpone this to November.

(both outputs from slabtop are taken with 3min distance -> ergo our box looses every 3 mins 14MB of memory if we use jool!)

Does it still leak if you disable joold?

gehoern commented 10 months ago

There is no joold running et all. We just have jool instances in the kernel running.

jool instance display
+--------------------+-----------------+-----------+
|          Namespace |            Name | Framework |
+--------------------+-----------------+-----------+
|           a388ea40 |         default | netfilter |
+--------------------+-----------------+-----------+

We tested this forth and back. If we have no jool instance in the kernel nothing is leaked. The amount of leaked memory depends on the amount of address translations (more NAT64 , more memory leaked) and the memory is NOT freed if we kill the session or the kernel module.

ydahhrk commented 10 months ago

There is no joold running et all.

Then this is a different problem. Please recompile with -DJKMEMLEAK and post the report.

You might want to create a new issue, because this has become a joold thread.

ydahhrk commented 9 months ago

@suffieldacademy Sorry; I'm still crunching. At this rate, I don't think I'll be able to look into this until mid December.

I take it it's still dropping sessions?

suffieldacademy commented 9 months ago

Yes, essentially unchanged since Oct 14. Still getting "Too many sessions deferred!" messages, including over this holiday weekend with very low traffic amounts. Difficult to say if it's causing issues; no complaints from users but the connections might be retried?

ydahhrk commented 8 months ago

Observations by stat:

JSTAT_JOOLD_PKT_SENT, JSTAT_JOOLD_ACKS: These always match. No ACKs are getting lost, in any instance.

Oddities by stat:

JSTAT_JOOLD_FLUSH_ASAP: B upper still seems to have flush-asap enabled. (While the others do not.)

(I'd expect all the flush-asaps to be consistent, and at this point I'd prefer them all to be disabled.)

JSTAT_JOOLD_QUEUING: Neither of the B instances are queuing, yet both A instances are queuing.

I'd expect both primaries to be queuing. The secondaries should not be queuing, unless one of the primaries has gone offline.

This might just be a consequence of the previous oddity; B is flushing quickly, which means queuing is discouraged.

Observations by instance:

B lower (secondary)

This instance received 7311 packets from IPv6, they were all translated successfully, and they were all joold'd queuelessly because they all happened more than ss-flush-deadline seconds away from each other. The counters are consistent.

It didn't receive any IPv4 packets; not even as responses for the IPv6 packets. It seems the Internet is correctly routing all lower packets to A, while the LAN isn't. (Though stranded lower packets are few.)

This is the only instance that isn't printing Too many sessions deferred!.

A upper (secondary)

The counters suggest this instance is receiving more traffic that belongs to its primary, through both IPv's. It also had to drop some sessions.

However, the advertise suggests there was a period in which the primary was inactive, and I suppose that explains both quirks.

B upper (primary) A lower (primary)

Other than ENOSPC, I don't see anything suspicious going on.

But ENOSPC is consistently small in the grand scheme of things. Maybe there's not much in the way of bugs left, and all that's left to do is adjust each field for its intended purpose.

ss-capacity might be too small; try raising it.
Since no ACKs are getting lost, the Netlink channel is probably not getting saturated. You could also try shrinking ss-flush-deadline, to reduce the queuing needed.

ydahhrk commented 8 months ago

Still getting "Too many sessions deferred!" messages, including over this holiday weekend with very low traffic amounts.

Maybe someone left a cron job that suddenly spikes the traffic at midnight, or something?

I suppose I could add stat counters to keep track of the queue length and ACK RTT. You could decide your ss-capacity and ss-flush-deadline based on these numbers.

Difficult to say if it's causing issues; no complaints from users but the connections might be retried?

Sure, they might be recovering on their own. Also, many sessions joold sends are redundant because they need to be updated very shortly afterwards.

suffieldacademy commented 7 months ago

Sorry, been very busy these last couple weeks. I'm going to dump out the current status of the counters just so we have recent numbers and so we can see if there are any changes. After this I'll double-check all the settings you mentioned (ss-capacity and ss-flush-deadline) and make some tweaks.

# (node a primary) ip netns exec sa-jool-ns jool -i nat64-wkp-lower stats display
JSTAT_RECEIVED6: 3610658425
JSTAT_RECEIVED4: 9966468183
JSTAT_SUCCESS: 13556576935
JSTAT_BIB_ENTRIES: 29270
JSTAT_SESSIONS: 29294
JSTAT_SKB_TRUNCATED: 2762
JSTAT_HDR4: 1
JSTAT_UNKNOWN_L4_PROTO: 528
JSTAT_UNKNOWN_ICMP4_TYPE: 96
JSTAT_UNKNOWN_PROTO_INNER: 2
JSTAT_POOL4_MISMATCH: 613345
JSTAT_BIB4_NOT_FOUND: 4894290
JSTAT_SESSION_NOT_FOUND: 21558
JSTAT_V4_SYN: 8859089
JSTAT_SYN6_EXPECTED: 5657127
JSTAT_SYN4_EXPECTED: 61559
JSTAT64_TTL: 11491
JSTAT46_TTL: 11983
JSTAT_FAILED_ROUTES: 415318
JSTAT_ICMP6ERR_SUCCESS: 11491
JSTAT_ICMP4ERR_SUCCESS: 11983
JSTAT_ICMPEXT_BIG: 524
JSTAT_JOOLD_EMPTY: 76781732
JSTAT_JOOLD_TIMEOUT: 1
JSTAT_JOOLD_MISSING_ACK: 4581630543
JSTAT_JOOLD_AD_ONGOING: 228
JSTAT_JOOLD_PKT_FULL: 520211402
JSTAT_JOOLD_QUEUING: 8899259103
JSTAT_JOOLD_SSS_QUEUED: 13556123855
JSTAT_JOOLD_SSS_SENT: 13525502358
JSTAT_JOOLD_SSS_RCVD: 104772
JSTAT_JOOLD_SSS_ENOSPC: 30627365
JSTAT_JOOLD_PKT_SENT: 520211631
JSTAT_JOOLD_PKT_RCVD: 104772
JSTAT_JOOLD_ADS: 1
JSTAT_JOOLD_ACKS: 520211631
# (node a secondary) ip netns exec sa-jool-ns jool -i nat64-wkp-upper stats display
JSTAT_RECEIVED6: 392104
JSTAT_RECEIVED4: 1917566
JSTAT_SUCCESS: 2305977
JSTAT_BIB_ENTRIES: 5576
JSTAT_SESSIONS: 5583
JSTAT_UNKNOWN_L4_PROTO: 1
JSTAT_POOL4_MISMATCH: 610
JSTAT_BIB4_NOT_FOUND: 122
JSTAT_V4_SYN: 2925
JSTAT_SYN6_EXPECTED: 22
JSTAT_SYN4_EXPECTED: 13
JSTAT_JOOLD_EMPTY: 1660972
JSTAT_JOOLD_TIMEOUT: 109670
JSTAT_JOOLD_MISSING_ACK: 750747
JSTAT_JOOLD_AD_ONGOING: 38
JSTAT_JOOLD_PKT_FULL: 84037
JSTAT_JOOLD_QUEUING: 1441682
JSTAT_JOOLD_SSS_QUEUED: 2305880
JSTAT_JOOLD_SSS_SENT: 2304131
JSTAT_JOOLD_SSS_RCVD: 5992950272
JSTAT_JOOLD_SSS_ENOSPC: 2709
JSTAT_JOOLD_PKT_SENT: 193745
JSTAT_JOOLD_PKT_RCVD: 1471006661
JSTAT_JOOLD_ADS: 1
JSTAT_JOOLD_ACKS: 193743
# (node b secondary) ip netns exec sa-jool-ns jool -i nat64-wkp-lower stats display
JSTAT_RECEIVED6: 105129
JSTAT_SUCCESS: 105129
JSTAT_BIB_ENTRIES: 29411
JSTAT_SESSIONS: 29441
JSTAT_JOOLD_EMPTY: 1652625
JSTAT_JOOLD_TIMEOUT: 105129
JSTAT_JOOLD_SSS_QUEUED: 105129
JSTAT_JOOLD_SSS_SENT: 105129
JSTAT_JOOLD_SSS_RCVD: 13491263006
JSTAT_JOOLD_PKT_SENT: 105129
JSTAT_JOOLD_PKT_RCVD: 518894731
JSTAT_JOOLD_ACKS: 105128
# (node b primary) ip netns exec sa-jool-ns jool -i nat64-wkp-upper stats display
JSTAT_RECEIVED6: 999272889
JSTAT_RECEIVED4: 5024326643
JSTAT_SUCCESS: 6019325034
JSTAT_BIB_ENTRIES: 5500
JSTAT_SESSIONS: 5505
JSTAT_SKB_TRUNCATED: 527
JSTAT_HDR4: 1
JSTAT_UNKNOWN_L4_PROTO: 595
JSTAT_UNKNOWN_ICMP4_TYPE: 18
JSTAT_UNKNOWN_PROTO_INNER: 1
JSTAT_POOL4_MISMATCH: 514425
JSTAT_BIB4_NOT_FOUND: 430017
JSTAT_SESSION_NOT_FOUND: 38569
JSTAT_V4_SYN: 2814391
JSTAT_SYN6_EXPECTED: 439203
JSTAT_SYN4_EXPECTED: 23852
JSTAT64_TTL: 16
JSTAT_FAILED_ROUTES: 12865
JSTAT_ICMP6ERR_SUCCESS: 16
JSTAT_ICMPEXT_BIG: 18
JSTAT_JOOLD_EMPTY: 667072784
JSTAT_JOOLD_TIMEOUT: 9749
JSTAT_JOOLD_MISSING_ACK: 5353347106
JSTAT_JOOLD_FLUSH_ASAP: 1475364608
JSTAT_JOOLD_SSS_QUEUED: 6018872393
JSTAT_JOOLD_SSS_SENT: 6010419681
JSTAT_JOOLD_SSS_RCVD: 2287867
JSTAT_JOOLD_SSS_ENOSPC: 8452712
JSTAT_JOOLD_PKT_SENT: 1475374357
JSTAT_JOOLD_PKT_RCVD: 192796
JSTAT_JOOLD_ACKS: 1475374357

suffieldacademy commented 7 months ago

OK, I have reconfigured the instances with the following

flush-asap was accidentally still on for node b; it is now disabled for all instances
ss-flush-deadline halved from 2000 to 1000
ss-capacity changed from default/unspecified (512) to 2048
logging-bib disabled (it was disabled previously as I was trying to reduce overall load on the machine, but wanted to point out that I had made the change)

It's not possible to restart both instances simultaneously without causing an outage, so I restarted node b followed by node a. Below are the stats for both shortly after both instances were brought back online so we have a basis for comparison:

# (node a primary) ip netns exec sa-jool-ns jool -i nat64-wkp-lower stats display
JSTAT_RECEIVED6: 14720
JSTAT_RECEIVED4: 34378
JSTAT_SUCCESS: 48210
JSTAT_BIB_ENTRIES: 5829
JSTAT_SESSIONS: 5857
JSTAT_POOL4_MISMATCH: 1
JSTAT_BIB4_NOT_FOUND: 34
JSTAT_V4_SYN: 9
JSTAT_SYN6_EXPECTED: 602
JSTAT_SYN4_EXPECTED: 229
JSTAT_FAILED_ROUTES: 13
JSTAT_JOOLD_EMPTY: 460
JSTAT_JOOLD_TIMEOUT: 2
JSTAT_JOOLD_MISSING_ACK: 10646
JSTAT_JOOLD_PKT_FULL: 1854
JSTAT_JOOLD_QUEUING: 37145
JSTAT_JOOLD_SSS_QUEUED: 48222
JSTAT_JOOLD_SSS_SENT: 48206
JSTAT_JOOLD_SSS_RCVD: 133564
JSTAT_JOOLD_PKT_SENT: 1856
JSTAT_JOOLD_PKT_RCVD: 5138
JSTAT_JOOLD_ACKS: 1856
# (node a secondary) ip netns exec sa-jool-ns jool -i nat64-wkp-upper stats display
JSTAT_RECEIVED6: 2
JSTAT_SUCCESS: 2
JSTAT_BIB_ENTRIES: 2818
JSTAT_SESSIONS: 2831
JSTAT_JOOLD_EMPTY: 31
JSTAT_JOOLD_TIMEOUT: 2
JSTAT_JOOLD_SSS_QUEUED: 2
JSTAT_JOOLD_SSS_SENT: 2
JSTAT_JOOLD_SSS_RCVD: 223184
JSTAT_JOOLD_PKT_SENT: 2
JSTAT_JOOLD_PKT_RCVD: 8584
JSTAT_JOOLD_ACKS: 2
# (node b secondary) ip netns exec sa-jool-ns jool -i nat64-wkp-lower stats display
JSTAT_RECEIVED6: 64739
JSTAT_RECEIVED4: 115575
JSTAT_SUCCESS: 178336
JSTAT_BIB_ENTRIES: 9661
JSTAT_SESSIONS: 9703
JSTAT_POOL4_MISMATCH: 12
JSTAT_BIB4_NOT_FOUND: 41
JSTAT_V4_SYN: 23
JSTAT_SYN6_EXPECTED: 1293
JSTAT_SYN4_EXPECTED: 576
JSTAT64_TTL: 1
JSTAT_FAILED_ROUTES: 32
JSTAT_ICMP6ERR_SUCCESS: 1
JSTAT_JOOLD_EMPTY: 1913
JSTAT_JOOLD_TIMEOUT: 5
JSTAT_JOOLD_MISSING_ACK: 33678
JSTAT_JOOLD_AD_ONGOING: 332
JSTAT_JOOLD_PKT_FULL: 6857
JSTAT_JOOLD_QUEUING: 142827
JSTAT_JOOLD_SSS_QUEUED: 178331
JSTAT_JOOLD_SSS_SENT: 186896
JSTAT_JOOLD_SSS_RCVD: 530090
JSTAT_JOOLD_PKT_SENT: 7194
JSTAT_JOOLD_PKT_RCVD: 20390
JSTAT_JOOLD_ADS: 1
JSTAT_JOOLD_ACKS: 7194
# (node b primary) ip netns exec sa-jool-ns jool -i nat64-wkp-upper stats display
JSTAT_RECEIVED6: 209411
JSTAT_RECEIVED4: 230710
JSTAT_SUCCESS: 439664
JSTAT_BIB_ENTRIES: 3005
JSTAT_SESSIONS: 3017
JSTAT_POOL4_MISMATCH: 16
JSTAT_BIB4_NOT_FOUND: 2
JSTAT_V4_SYN: 66
JSTAT_SYN6_EXPECTED: 223
JSTAT_SYN4_EXPECTED: 150
JSTAT_JOOLD_EMPTY: 1946
JSTAT_JOOLD_TIMEOUT: 3
JSTAT_JOOLD_MISSING_ACK: 140810
JSTAT_JOOLD_AD_ONGOING: 114
JSTAT_JOOLD_PKT_FULL: 16909
JSTAT_JOOLD_QUEUING: 296995
JSTAT_JOOLD_SSS_QUEUED: 439664
JSTAT_JOOLD_SSS_SENT: 442580
JSTAT_JOOLD_SSS_RCVD: 238860
JSTAT_JOOLD_PKT_SENT: 17026
JSTAT_JOOLD_PKT_RCVD: 9190
JSTAT_JOOLD_ADS: 1
JSTAT_JOOLD_ACKS: 17026

suffieldacademy commented 7 months ago

A quick checkin after running this for about a week. Anecdotally the number of "Too many sessions deferred!" messages has diminished; we're now seeing hours between some of the messages whereas before it was several times per hour. We still do have some bursts that are closer to each other, but overall the number is slowing.

Assuming this improvement is from tweaking the parameters, should we continue to change them? I'm guessing we would need to continue increasing ss-capacity as lowering the deadline probably won't continue to help. If so, I can make it much bigger (10x) just to see if it helps. Since there is a flush deadline, having a large capacity shouldn't affect anything other than RAM usage, right (even though the buffer is big, it will still be flushed on a schedule)?

Below I'll post our current counters just for reference and so we can get an absolute sense of how things are going.

# (node a primary) ip netns exec sa-jool-ns jool -i nat64-wkp-lower stats display
JSTAT_RECEIVED6: 1069308022
JSTAT_RECEIVED4: 2520677922
JSTAT_SUCCESS: 3583041960
JSTAT_BIB_ENTRIES: 42870
JSTAT_SESSIONS: 42945
JSTAT_SKB_TRUNCATED: 43715
JSTAT_UNKNOWN_L4_PROTO: 105
JSTAT_UNKNOWN_ICMP4_TYPE: 8
JSTAT_POOL4_MISMATCH: 107970
JSTAT_BIB4_NOT_FOUND: 1751874
JSTAT_SESSION_NOT_FOUND: 10338
JSTAT_V4_SYN: 3347143
JSTAT_SYN6_EXPECTED: 1484615
JSTAT_SYN4_EXPECTED: 23260
JSTAT64_TTL: 1376
JSTAT46_TTL: 3796
JSTAT_FAILED_ROUTES: 169680
JSTAT_ICMP6ERR_SUCCESS: 1376
JSTAT_ICMP4ERR_SUCCESS: 3796
JSTAT_ICMPEXT_BIG: 104
JSTAT_JOOLD_EMPTY: 19626025
JSTAT_JOOLD_TIMEOUT: 2
JSTAT_JOOLD_MISSING_ACK: 1215434001
JSTAT_JOOLD_PKT_FULL: 137771498
JSTAT_JOOLD_QUEUING: 2347821764
JSTAT_JOOLD_SSS_QUEUED: 3582591136
JSTAT_JOOLD_SSS_SENT: 3582058950
JSTAT_JOOLD_SSS_RCVD: 153259
JSTAT_JOOLD_SSS_ENOSPC: 532170
JSTAT_JOOLD_PKT_SENT: 137771500
JSTAT_JOOLD_PKT_RCVD: 24833
JSTAT_JOOLD_ACKS: 137771500
# (node a secondary) ip netns exec sa-jool-ns jool -i nat64-wkp-upper stats display
JSTAT_RECEIVED6: 25717
JSTAT_SUCCESS: 25717
JSTAT_BIB_ENTRIES: 6563
JSTAT_SESSIONS: 6576
JSTAT_JOOLD_EMPTY: 312554
JSTAT_JOOLD_TIMEOUT: 23184
JSTAT_JOOLD_MISSING_ACK: 61
JSTAT_JOOLD_QUEUING: 3756
JSTAT_JOOLD_SSS_QUEUED: 25717
JSTAT_JOOLD_SSS_SENT: 25717
JSTAT_JOOLD_SSS_RCVD: 1778029696
JSTAT_JOOLD_PKT_SENT: 23184
JSTAT_JOOLD_PKT_RCVD: 68426186
JSTAT_JOOLD_ACKS: 23184
# (node b secondary) ip netns exec sa-jool-ns jool -i nat64-wkp-lower stats display
JSTAT_RECEIVED6: 84498
JSTAT_RECEIVED4: 115575
JSTAT_SUCCESS: 198095
JSTAT_BIB_ENTRIES: 43534
JSTAT_SESSIONS: 43616
JSTAT_POOL4_MISMATCH: 12
JSTAT_BIB4_NOT_FOUND: 41
JSTAT_V4_SYN: 23
JSTAT_SYN6_EXPECTED: 1293
JSTAT_SYN4_EXPECTED: 576
JSTAT64_TTL: 1
JSTAT_FAILED_ROUTES: 32
JSTAT_ICMP6ERR_SUCCESS: 1
JSTAT_JOOLD_EMPTY: 312296
JSTAT_JOOLD_TIMEOUT: 19764
JSTAT_JOOLD_MISSING_ACK: 33678
JSTAT_JOOLD_AD_ONGOING: 332
JSTAT_JOOLD_PKT_FULL: 6857
JSTAT_JOOLD_QUEUING: 142827
JSTAT_JOOLD_SSS_QUEUED: 198090
JSTAT_JOOLD_SSS_SENT: 206655
JSTAT_JOOLD_SSS_RCVD: 3568568564
JSTAT_JOOLD_PKT_SENT: 26953
JSTAT_JOOLD_PKT_RCVD: 137252639
JSTAT_JOOLD_ADS: 1
JSTAT_JOOLD_ACKS: 26953
# (node b primary) ip netns exec sa-jool-ns jool -i nat64-wkp-upper stats display
JSTAT_RECEIVED6: 311602953
JSTAT_RECEIVED4: 1474760933
JSTAT_SUCCESS: 1785639579
JSTAT_BIB_ENTRIES: 6460
JSTAT_SESSIONS: 6472
JSTAT_SKB_TRUNCATED: 5022
JSTAT_UNKNOWN_L4_PROTO: 104
JSTAT_POOL4_MISMATCH: 89860
JSTAT_BIB4_NOT_FOUND: 80313
JSTAT_SESSION_NOT_FOUND: 13837
JSTAT_V4_SYN: 435972
JSTAT_SYN6_EXPECTED: 88170
JSTAT_SYN4_EXPECTED: 6203
JSTAT64_TTL: 9
JSTAT_FAILED_ROUTES: 4784
JSTAT_ICMP6ERR_SUCCESS: 9
JSTAT_ICMPEXT_BIG: 33
JSTAT_JOOLD_EMPTY: 7269038
JSTAT_JOOLD_TIMEOUT: 85005
JSTAT_JOOLD_MISSING_ACK: 653906446
JSTAT_JOOLD_AD_ONGOING: 114
JSTAT_JOOLD_PKT_FULL: 68619571
JSTAT_JOOLD_QUEUING: 1124598375
JSTAT_JOOLD_SSS_QUEUED: 1785483148
JSTAT_JOOLD_SSS_SENT: 1785267361
JSTAT_JOOLD_SSS_RCVD: 264485
JSTAT_JOOLD_SSS_ENOSPC: 218723
JSTAT_JOOLD_PKT_SENT: 68704690
JSTAT_JOOLD_PKT_RCVD: 32293
JSTAT_JOOLD_ADS: 1
JSTAT_JOOLD_ACKS: 68704690

ydahhrk commented 7 months ago

Analysis attached. (Tip: Hide columns B, C, G and H.)

I don't see anything alarming anymore, though I still don't quite like the numbers. Excluding the sessions it dropped because of ss-capacity, a-primary sent 3,582,010,744 sessions, of which 3,568,038,474 made it to b-secondary. That's 13,972,270 (0.39%) lost.

But no Netlink packets are getting dropped, so it seems they're getting lost in the network.

Assuming this improvement is from tweaking the parameters, should we continue to change them?

Well... ideally yes, though the ss-capacity session drop count (532,170) seems somewhat small in comparison to the (presumable) network drop count (13,972,270).

I'm guessing we would need to continue increasing ss-capacity as lowering the deadline probably won't continue to help.

Hmm... HMMMM. 1 second is still a lot in computer time. I'm trying to think how we might graph the average RTT with those ACKs, but since there are a lot every second, the data set of a single day would probably be gargantuan.

ss-flush-deadline mainly exists to prevent Jool from saturating the Netlink channel. The default is comical; it aims for one Netlink packet at a time. I don't know what the limit is, but you can probably easily fit more in there without breaking a sweat.

The more data you can send at a time, the less you have to queue. You can dramatically reduce your needed capacity by sending more greedily.

I mean... your current capacity is 2048. Since you can fit 26 sessions per packet... that's a 79-sized packet queue. You have 79 packets waiting for their turn, and you can only send one at a time, each having to wait for the previous's ACK. And while that happens, more sessions are arriving.

And sure, since those traffic spikes seem to be rare, a mega queue will probably be able to retain the sessions long enough for the queue to gradually clear naturally. But it's one of those things where increasing capacity will net you linear improvements, while decreasing deadline could be an order of magnitude better.

The way to know if you're saturating the Netlink channel is by checking if the ACKs are getting dropped. That would be JSTAT_JOOLD_PKT_SENT - JSTAT_JOOLD_ACKS. I'd say keep reducing ss-flush-deadline until the subtraction stops being zero.

I can make it much bigger (10x) just to see if it helps.

I'm trying to think why you would not want to do this, but I'm coming out empty-handed.

Outside of the unit tests and cleanup, Jool never traverses more than ss-max-sessions-per-packet nodes in that list at a time, so the performance shouldn't tank. It seems your RAM is your limit, but each node costs 112 bytes (in my PC)... a big fat nothing.

Since there is a flush deadline, having a large capacity shouldn't affect anything other than RAM usage, right (even though the buffer is big, it will still be flushed on a schedule)?

Right... I think.

I just realized the word "flush" is misleading. I don't see anything in your text that suggests you've been misled, but just to be clear:

It's not a strict full "flush." It only "flushes" the amout of sessions it can send at a time (in your case, 26). Because it doesn't drop them; it forces a single packet out.

Sessions that arrive while the queue is full are dropped. So in times of stress, the queue fills up, and new sessions are dropped. The deadline "flushes" 26 sessions. So 26 new sessions make it in, and the rest are dropped again.

~If you can think of a better way to implement this, let me know.~

ydahhrk commented 7 months ago

Well... ideally yes (...) I'd say keep reducing ss-flush-deadline until the subtraction stops being zero.

(Also, remember that ss-flush-deadline and ss-capacity are runtime switches, and don't have to be synchronized across instances. You can experiment without having to restart nor re-advertise.)

suffieldacademy commented 7 months ago

Thank you so much for that spreadsheet, it helped make a lot of things clearer. Additionally, your description of ss-flush-deadline was very helpful; I was imagining it more as a "if a full session packet hasn't been sent within this timeframe, send one anyway." I see now that it's more of an unacked packet timer, so making it lower does make sense.

I've lowered ss-flush-deadline to 100ms to see what that does. The machines are directly connected, so there should be very low RTT latency network-wise (only processing on the nodes), so I think that's a reasonable starting point and we could even go lower. I'll watch the numbers for a bit and see how they look. My students are away for the weekend, so I'll wait until they're back so we have normal traffic levels.

ydahhrk commented 7 months ago

Additionally, your description of ss-flush-deadline was very helpful; I was imagining it more as a "if a full session packet hasn't been sent within this timeframe, send one anyway." I see now that it's more of an unacked packet timer, so making it lower does make sense.

Wait... what?

... ...

[Scratches head] Ok, ss-flush-deadline is both of those things. It's just that the latter is more important right now, yeah.

Look, full disclosure. Let's let the code talk. That function returns true if the packet should be sent, false if the buffering should continue.

As you can see, the flush_deadline validation is pretty high up and short-circuits, skipping the rest of the function. This can be read as "if more than ss-flush-deadline milliseconds have elapsed since the last time we sent a packet, then I don't care if we haven't received an ACK, I don't care about ADs (advertisements), don't care about flush-asap, don't care if the queue is small. Send it out NOW."

So yes; it can be used to skip slow ACKs (assuming they're really being slow, IDK), but in essence it's more of a general stagnation-avoidance policy.

Sorry for the confusion.

I'll watch the numbers for a bit and see how they look. My students are away for the weekend, so I'll wait until they're back so we have normal traffic levels.

ACK.

suffieldacademy commented 6 months ago

OK, we've been back for a bit so I gathered some more stats and pushed them into a spreadsheet. The loss rates are less than 1%, but don't look a whole lot better than the last time (this is with a 100ms flush deadline).

Everything is still being ACKed, so I'll lower the flush deadline again to see if there are any changes.

sa-joold-stats.ods

ydahhrk commented 6 months ago

Did you also increase ss-capacity?

suffieldacademy commented 6 months ago

Yes, simultaneously decreased ss-flush-deadline to 20ms and increased ss-capacity to 16384. Latest numbers are attached for review. At first glance it looks like the overall loss is about the same, though the "packets queued but not sent" has dropped to almost negligible levels, which is good. Also, it looks like ENOSPC is zero for both, so that's good, right?

I've lost track a little bit of what the different numbers mean. Should I continue to lower the deadline and increase buffer, or now that ENOSPC is zero am I basically done? Are all the other lost packets just due to network losses of some sort?

sa-joold-stats-20240225.ods