Closed suffieldacademy closed 2 months ago
Also, it looks like
ENOSPC
is zero for both, so that's good, right?
Yes
Should I continue to lower the deadline and increase buffer, or now that
ENOSPC
is zero am I basically done?
The capacity and deadline only target ENOSPC
. You're good.
Are all the other lost packets just due to network losses of some sort?
Probably; I don't see any other explanation.
If you want to analyze it, I just uploaded one more test commit: Stats on the userspace daemons.
This one is simpler: fcc5ccc4be2fbca697b2a4a2e447dc9206b83f44.
If you want. At this point I feel like the bug is fixed already.
Sorry, was on vacation and then busy. Trying out the latest commit so we can put this issue to bed. Unfortunately, joold
doesn't seem to want to start with this latest version. I get a log message that it's trying to open a "port" that is the netsocket file?
2024-03-30T22:46:55.184489-04:00 mario joold: Setting up statsocket (port /run/sa-joold-nat64-wkp-lower/modsocket.json)...
Maybe an off-by-one error in the arg processing?
if (argc < 3) {
syslog(LOG_INFO, "statsocket port unavailable; skipping statsocket.");
return 0;
}
error = create_socket(argv[2], &sk);
if (error)
return error;
I'm exiting with a -8 code. My guess is I need to pass a third arg to joold
but I couldn't find any documentation on that. Any hints on what to pass (and if the code is correct)? I'm not much of a C hacker so these are my only guesses...
Any hints on what to pass?
See the commit message of fcc5ccc4be2fbca697b2a4a2e447dc9206b83f44.
Maybe an off-by-one error in the arg processing?
You're right.
Patch uploaded: b1e502102965fbd84653b68400efde8b28de7077
See the commit message of fcc5ccc.
Ah, sorry. I did see that go by but got distracted and only looked at the man page. I've got that figured out now.
There is still an off-by-one in statsocket.c
(wrong argv[] index):
diff --git a/src/usr/joold/statsocket.c b/src/usr/joold/statsocket.c
index 234ca199..b2e220d2 100644
--- a/src/usr/joold/statsocket.c
+++ b/src/usr/joold/statsocket.c
@@ -117,7 +117,7 @@ int statsocket_start(int argc, char **argv)
return 0;
}
- error = create_socket(argv[2], &sk);
+ error = create_socket(argv[3], &sk);
if (error)
return error;
With that change I'm up and running and I am able to query the stats socket, so I think we're all set.
Thank you again for all of your help and patience with this issue! I'm fine to close this out now, and look forward to having it all rolled up in a future release.
There is still an off-by-one in
statsocket.c
(wrong argv[] index):
Huhhhhhhhhhhhh?
🤦🤦🤦🤦🤦🤦🤦
How did I miss this during the release? How did I miss it during the TESTING?
WHAT
You know what, I'm going to upgrade to argp. This is clearly not working out.
Hello again. Think I'm finally happy with it.
joold
's interface has been bugging me for a while, because of its strange reliance on files:
joold <netsocket file> <modsocket file> <statsocket port>
I decided to deprecate that, and move userspace joold to jool session proxy
. As determined before, it's an argp spread like the rest of jool
's commnads:
$ jool session proxy --help
Usage: proxy [OPTION...] <net.mcast.address>
-i, --net.dev.in=STR IPv4: IP_ADD_MEMBERSHIP; IPv6: IPV6_ADD_MEMBERSHIP
(see ip(7))
-o, --net.dev.out=STR IPv4: IP_MULTICAST_IF, IPv6: IPV6_MULTICAST_IF
(see ip(7))
-p, --net.mcast.port=STR UDP port where the sessions will be advertised
--stats.address=STR Address to bind the stats socket to
--stats.port=STR Port to bind the stats socket to
-t, --net.ttl=INT Multicast datagram Time To Live
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version
Similarly, jool joold advertise
is now jool session advertise
.
You don't need to update your scripts because, even though the old commnads are deprecated, they still work, and I'm not really in a rush to delete them. But it'd be great if you could confirm I didn't break something again.
The code is in the main
branch.
Version 4.1.13 released; closing.
We have two jool boxes in an active/active load-sharing setup that we're testing for our campus. Things have been fine for months in limited testing. This week we added more test clients to the boxes and have been getting several machine lockups requiring a hard reboot. The message on the screen is typically out of memory.
This is jool 4.1.8.0 on Debian Bullseye.
I'm not a kernel expert, but looking at some of the other memory issue reports people mentioned /proc/slabinfo. I sampled that every 2 seconds and the "jool_joold_nodes" line is increasing constantly (this is on a machine that's been up less than 2 hours and 'jool session display' lists approximately 12,000 sessions):
I sampled active sessions in jool and even when those decreased, the slabs continued to increase. Meanwhile, "available" memory (as reported by top/free) has been steadily decreasing (several MiB per minute). Since we've increased the number of users, the machines have needed a reboot in as few as 20 hours.
I don't know enough about the kernel structures to know what jool_joold_nodes represents, but I'm guessing it shouldn't be monotonically increasing. I'm happy to gather any additional data that may be helpful.
One of my two boxes is locked up at the moment, but once I'm back to fully redundant I can try things like restarting jool or unloading the kernel module to see if we can recover memory without rebooting.