OOM on Jool boxes, possible leak?

suffieldacademy commented 1 year ago

We have two jool boxes in an active/active load-sharing setup that we're testing for our campus. Things have been fine for months in limited testing. This week we added more test clients to the boxes and have been getting several machine lockups requiring a hard reboot. The message on the screen is typically out of memory.

This is jool 4.1.8.0 on Debian Bullseye.

I'm not a kernel expert, but looking at some of the other memory issue reports people mentioned /proc/slabinfo. I sampled that every 2 seconds and the "jool_joold_nodes" line is increasing constantly (this is on a machine that's been up less than 2 hours and 'jool session display' lists approximately 12,000 sessions):

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
jool_joold_nodes  3804464 3804464    120   34    1 : tunables    0    0    0 : slabdata 111896 111896      0
session_nodes      14106  60294    104   39    1 : tunables    0    0    0 : slabdata   1546   1546      0
bib_nodes          14532  60690     96   42    1 : tunables    0    0    0 : slabdata   1445   1445      0
jool_xlations        180    180    728   45    8 : tunables    0    0    0 : slabdata      4      4      0

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
jool_joold_nodes  3804668 3804668    120   34    1 : tunables    0    0    0 : slabdata 111902 111902      0
session_nodes      14129  60294    104   39    1 : tunables    0    0    0 : slabdata   1546   1546      0
bib_nodes          14564  60690     96   42    1 : tunables    0    0    0 : slabdata   1445   1445      0
jool_xlations        180    180    728   45    8 : tunables    0    0    0 : slabdata      4      4      0

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
jool_joold_nodes  3805620 3805620    120   34    1 : tunables    0    0    0 : slabdata 111930 111930      0
session_nodes      14150  60294    104   39    1 : tunables    0    0    0 : slabdata   1546   1546      0
bib_nodes          14557  60690     96   42    1 : tunables    0    0    0 : slabdata   1445   1445      0
jool_xlations        180    180    728   45    8 : tunables    0    0    0 : slabdata      4      4      0

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
jool_joold_nodes  3805824 3805824    120   34    1 : tunables    0    0    0 : slabdata 111936 111936      0
session_nodes      14125  60294    104   39    1 : tunables    0    0    0 : slabdata   1546   1546      0
bib_nodes          14539  60690     96   42    1 : tunables    0    0    0 : slabdata   1445   1445      0
jool_xlations        180    180    728   45    8 : tunables    0    0    0 : slabdata      4      4      0

I sampled active sessions in jool and even when those decreased, the slabs continued to increase. Meanwhile, "available" memory (as reported by top/free) has been steadily decreasing (several MiB per minute). Since we've increased the number of users, the machines have needed a reboot in as few as 20 hours.

I don't know enough about the kernel structures to know what jool_joold_nodes represents, but I'm guessing it shouldn't be monotonically increasing. I'm happy to gather any additional data that may be helpful.

One of my two boxes is locked up at the moment, but once I'm back to fully redundant I can try things like restarting jool or unloading the kernel module to see if we can recover memory without rebooting.

ydahhrk commented 8 months ago

Also, it looks like ENOSPC is zero for both, so that's good, right?

Yes

Should I continue to lower the deadline and increase buffer, or now that ENOSPC is zero am I basically done?

The capacity and deadline only target ENOSPC. You're good.

Are all the other lost packets just due to network losses of some sort?

Probably; I don't see any other explanation.

If you want to analyze it, I just uploaded one more test commit: Stats on the userspace daemons.

This one is simpler: fcc5ccc4be2fbca697b2a4a2e447dc9206b83f44.

If you want. At this point I feel like the bug is fixed already.

suffieldacademy commented 7 months ago

Sorry, was on vacation and then busy. Trying out the latest commit so we can put this issue to bed. Unfortunately, joold doesn't seem to want to start with this latest version. I get a log message that it's trying to open a "port" that is the netsocket file?

2024-03-30T22:46:55.184489-04:00 mario joold: Setting up statsocket (port /run/sa-joold-nat64-wkp-lower/modsocket.json)...

Maybe an off-by-one error in the arg processing?

        if (argc < 3) {
        syslog(LOG_INFO, "statsocket port unavailable; skipping statsocket.");
        return 0;
    }

    error = create_socket(argv[2], &sk);
    if (error)
        return error;

I'm exiting with a -8 code. My guess is I need to pass a third arg to joold but I couldn't find any documentation on that. Any hints on what to pass (and if the code is correct)? I'm not much of a C hacker so these are my only guesses...

ydahhrk commented 7 months ago

Any hints on what to pass?

See the commit message of fcc5ccc4be2fbca697b2a4a2e447dc9206b83f44.

Maybe an off-by-one error in the arg processing?

You're right.

ydahhrk commented 7 months ago

Patch uploaded: b1e502102965fbd84653b68400efde8b28de7077

suffieldacademy commented 7 months ago

See the commit message of fcc5ccc.

Ah, sorry. I did see that go by but got distracted and only looked at the man page. I've got that figured out now.

There is still an off-by-one in statsocket.c (wrong argv[] index):

diff --git a/src/usr/joold/statsocket.c b/src/usr/joold/statsocket.c
index 234ca199..b2e220d2 100644
--- a/src/usr/joold/statsocket.c
+++ b/src/usr/joold/statsocket.c
@@ -117,7 +117,7 @@ int statsocket_start(int argc, char **argv)
                return 0;
        }

-       error = create_socket(argv[2], &sk);
+       error = create_socket(argv[3], &sk);
        if (error)
                return error;

With that change I'm up and running and I am able to query the stats socket, so I think we're all set.

Thank you again for all of your help and patience with this issue! I'm fine to close this out now, and look forward to having it all rolled up in a future release.

ydahhrk commented 3 months ago

There is still an off-by-one in statsocket.c (wrong argv[] index):

Huhhhhhhhhhhhh?

🤦🤦🤦🤦🤦🤦🤦

How did I miss this during the release? How did I miss it during the TESTING?

WHAT

ydahhrk commented 3 months ago

You know what, I'm going to upgrade to argp. This is clearly not working out.

ydahhrk commented 3 months ago

Hello again. Think I'm finally happy with it.

joold's interface has been bugging me for a while, because of its strange reliance on files:

joold <netsocket file> <modsocket file> <statsocket port>

I decided to deprecate that, and move userspace joold to jool session proxy. As determined before, it's an argp spread like the rest of jool's commnads:

$ jool session proxy --help
Usage: proxy [OPTION...] <net.mcast.address>

  -i, --net.dev.in=STR       IPv4: IP_ADD_MEMBERSHIP; IPv6: IPV6_ADD_MEMBERSHIP
                             (see ip(7))
  -o, --net.dev.out=STR      IPv4: IP_MULTICAST_IF, IPv6: IPV6_MULTICAST_IF
                             (see ip(7))
  -p, --net.mcast.port=STR   UDP port where the sessions will be advertised
      --stats.address=STR    Address to bind the stats socket to
      --stats.port=STR       Port to bind the stats socket to
  -t, --net.ttl=INT          Multicast datagram Time To Live
  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

Similarly, jool joold advertise is now jool session advertise.

You don't need to update your scripts because, even though the old commnads are deprecated, they still work, and I'm not really in a rush to delete them. But it'd be great if you could confirm I didn't break something again.

The code is in the main branch.

ydahhrk commented 2 months ago

Version 4.1.13 released; closing.

NICMx / Jool

OOM on Jool boxes, possible leak? #410