chaos / powerman

cluster power control
GNU General Public License v2.0
43 stars 19 forks source link

powermand: segfault #197

Closed chu11 closed 1 month ago

chu11 commented 2 months ago

A crash of powermand recently occurred w/ either of the following

  pm -0 "pclstr[201-896]-p"
  pm -0 "peclstr[201-896]-s"

admins were able to get a core dump but for the life of me I cannot get debug symbols from powermand to get a trace. Not clear if it wasn't built with debug symbols or rpmbuild messed up and didn't build debuginfo rpm.

The problem disappeared when the hostnames for this machine happened to be changed.

Anyways, perhaps best thing to try is to try and emulate that machine's powerman.conf and create a temp /etc/hosts file with the old hostnames and see if we can reproduce under development branch. The assumption is there's something unique to the hostname formatting.

chu11 commented 2 months ago

was eventually able to find all the right debuginfo rpms, they were hidden in a different place than i was expecting

(gdb) bt
#0  snprintf (__fmt=0x555555572235 "%s", __n=1024, __s=0x7fffffffe700 "") at /usr/include/bits/stdio2.h:68
#1  _process_setresult (act=act@entry=0x55557785ccb0, e=<optimized out>, dev=<optimized out>, dev=<optimized out>) at device.c:1183
#2  0x000055555555b8cc in _process_stmt (timeout=0x7fffffffec10, e=0x555577d128a0, act=0x55557785ccb0, dev=0x555559460dd0) at device.c:945
#3  _process_action (timeout=<optimized out>, dev=<optimized out>) at device.c:871
#4  dev_post_poll (pfd=0x5555557b31e0, timeout=0x7fffffffec10) at device.c:1618
#5  0x00005555555572cb in _select_loop () at powermand.c:189
#6  main (argc=<optimized out>, argv=<optimized out>) at powermand.c:129
(gdb) up
#1  _process_setresult (act=act@entry=0x55557785ccb0, e=<optimized out>, dev=<optimized out>, dev=<optimized out>) at device.c:1183
1183                    snprintf(strbuf, sizeof(strbuf), "%s", arg->val);
(gdb) p arg
$2 = (Arg *) 0x0

so arg is NULL for some reason.

            if ((arg = arglist_find(act->arglist, plug->node))) {
                arg->result = result;
                xfree(arg->val);
                arg->val = xstrdup(str);
            }

            if (result != RT_SUCCESS) {
                char strbuf[1024];
                snprintf(strbuf, sizeof(strbuf), "%s", arg->val);
                /* remove trailing carriage return or newline */
                strbuf[strcspn(strbuf, "\r\n")] = '\0';
                act->dpf_fun(act->client_id, "%s: %s", arg->node, strbuf);
            }

So presumably arglist_find() didn't find what was intended.

Initial assumption, I'm wondering if some part of powerman parses nodes of the form pclstrXXX-p incorrectly, like the hyphen messes things up. And thus messes up this parsing.

Edit:

scratch that

(gdb) p *plug
$3 = {name = 0x5555594978f0 "Enclosure", node = 0x5555594b2c60 "clustr-rak569"}
chu11 commented 2 months ago

grabbing the original powerman.conf and running it under simulation mode (remove --resolve-hosts and add --test-mode to each redifshpower line) did not reproduce the issue.

So the assumption is something "real networking" (--resolve-hosts + libcurl, etc.) contributed to this issue.

chu11 commented 2 months ago

running powermand under valgrind with a small-ish config I did see this

==2241718== Invalid read of size 8
==2241718==    at 0x40FCC4: _iterator_advance (hostlist.c:2281)
==2241718==    by 0x40FCC4: hostlist_next (hostlist.c:2315)
==2241718==    by 0x40240D: arglist_create (arglist.c:87)
==2241718==    by 0x402AEB: _create_command (client.c:533)
==2241718==    by 0x4032AF: _parse_input (client.c:617)
==2241718==    by 0x4032AF: _handle_input (client.c:1085)
==2241718==    by 0x403FD8: cli_post_poll (client.c:1176)
==2241718==    by 0x4020BF: _select_loop (powermand.c:188)
==2241718==    by 0x4020BF: main (powermand.c:129)
==2241718==  Address 0xbb5c430 is 0 bytes after a block of size 3,200 alloc'd
==2241718==    at 0x4C40B77: realloc (vg_replace_malloc.c:1690)
==2241718==    by 0x40CC51: hostlist_resize (hostlist.c:1105)
==2241718==    by 0x40DA2F: hostlist_copy (hostlist.c:1573)
==2241718==    by 0x402B1C: _create_command (client.c:519)
==2241718==    by 0x4032AF: _parse_input (client.c:617)
==2241718==    by 0x4032AF: _handle_input (client.c:1085)
==2241718==    by 0x403FD8: cli_post_poll (client.c:1176)
==2241718==    by 0x4020BF: _select_loop (powermand.c:188)
==2241718==    by 0x4020BF: main (powermand.c:129)
==2241718== 
==2241718== Invalid read of size 8
==2241718==    at 0x40FCC4: _iterator_advance (hostlist.c:2281)
==2241718==    by 0x40FCC4: hostlist_next (hostlist.c:2315)
==2241718==    by 0x402530: arglist_next (arglist.c:146)
==2241718==    by 0x40397F: _client_query_status_reply (client.c:424)
==2241718==    by 0x40397F: _act_finish (client.c:757)
==2241718==    by 0x406627: _process_action (device.c:895)
==2241718==    by 0x406627: dev_post_poll (device.c:1618)
==2241718==    by 0x4020CA: _select_loop (powermand.c:189)
==2241718==    by 0x4020CA: main (powermand.c:129)
==2241718==  Address 0xbc5f060 is 0 bytes after a block of size 3,200 alloc'd
==2241718==    at 0x4C40B77: realloc (vg_replace_malloc.c:1690)
==2241718==    by 0x40CC51: hostlist_resize (hostlist.c:1105)
==2241718==    by 0x40DA2F: hostlist_copy (hostlist.c:1573)
==2241718==    by 0x402425: arglist_create (arglist.c:94)
==2241718==    by 0x402AEB: _create_command (client.c:533)
==2241718==    by 0x4032AF: _parse_input (client.c:617)
==2241718==    by 0x4032AF: _handle_input (client.c:1085)
==2241718==    by 0x403FD8: cli_post_poll (client.c:1176)
==2241718==    by 0x4020BF: _select_loop (powermand.c:188)
==2241718==    by 0x4020BF: main (powermand.c:129)
==2241718== 

so perhaps there is a unhappy thing going on somewhere deep inside powermand or perhaps the hostlist library

chu11 commented 2 months ago

hmmmm

static void _iterator_advance(hostlist_iterator_t i)
{
    assert(i != NULL);
    assert(i->magic == HOSTLIST_MAGIC);
    if (i->idx > i->hl->nranges - 1)
        return;
    if (++(i->depth) > (i->hr->hi - i->hr->lo)) {
        i->depth = 0;
        i->hr = i->hl->hr[++i->idx];  // this line Invalid read of size 8
    }
}

My initial feeling was the if statement above should be if (i->idx >= (i->hl->nranges - 1)) to ensure ++i->idx does not go off the end of the array, but that lead to a bunch of test failures, likely b/c I don't understand the depth/hi/lo logic.

I instead tried assert (i->idx < (i->hl->size - 1)); right before the offending line. Presumably that should ensure we never go off the end of the array. But I also get test errors, suggesting the assert is hit. Might need to get some help here.

garlick commented 2 months ago

FWIW I added a unit test that used that hostname format with an iterator and didn't get a hit from valgrind. (Test is on my issue#197 branch, commit a1438efe2733d00d54fea4bbe288402453978379)

chu11 commented 2 months ago

The valgrind warnings go away if I do ...

diff --git a/src/liblsd/hostlist.c b/src/liblsd/hostlist.c
index 2ab4c045..f5e163a5 100644
--- a/src/liblsd/hostlist.c
+++ b/src/liblsd/hostlist.c
@@ -2278,6 +2278,11 @@ static void _iterator_advance(hostlist_iterator_t i)
         return;
     if (++(i->depth) > (i->hr->hi - i->hr->lo)) {
         i->depth = 0;
+        if (i->idx >= (i->hl->size - 1)) {
+            printf ("i think i'll array out of bounds %d %d\n", i->idx, i->hl->size);
+            ++i->idx;
+            return;
+        }
         i->hr = i->hl->hr[++i->idx];
     }
 }

I'm wondering if the setting of i->hr to some out of bounds value has some side effect later on in the code.

Edit:

also added value of nranges and got

i think i'll array out of bounds 399 400 400 outputting. So original code appears to have tried to go out of range. but some powerman tests are failing, so there is some side effect.

grondo commented 2 months ago

That "fix" seems reasonable to me, do the tests all pass with that?

I was also unable to reproduce this locally strangely (I think because in most cases hl->size (the number of entries allocated in the hl->hr array) is much larger than hl->nranges (the number of active hostranges)) so hl->hr == NULL in the common case and no out of bounds is used.

I wonder if in your workaround you set i->hr = NULL if that will allow your tests to pass?

Or alternately, ensure hl->size > hl->nranges when reallocing hl->hr.

grondo commented 2 months ago

Do you have a standalone hostlist test that reproduces this? I could add it to the testsuite and run it under valgrind (also would like to test against libflux-hostlist just in case).

grondo commented 2 months ago

also added value of nranges and got

i think i'll array out of bounds 399 400 400 outputting. So original code appears to have tried to go out of range. but some powerman tests are failing, so there is some side effect.

I don't understand this, can you explain what i think i'll array out of bounds ... means?

chu11 commented 2 months ago

I don't understand this, can you explain what i think i'll array out of bounds ... means?

Sorry, that was just my debugging printf otuput. 399 was i->idx, 400 was i->hl->size and 400 was i->hl->nranges. The output was right before the ++i->idx.

chu11 commented 2 months ago

That "fix" seems reasonable to me, do the tests all pass with that?

The tests now pass. I may have had some other debug stuff lingering that made it fail earlier.

what's interesting is that the warning from valgrind only appears once after the very first power query. after that it disappears. One would think it should happen a lot, making think there is some side effect cause elsewhere. Or perhaps it only happens on the "copied" hostlist once but there is no issue on the original (or vice versa).

Do you have a standalone hostlist test that reproduces this? I could add it to the testsuite and run it under valgrind (also would like to test against libflux-hostlist just in case).

Still trying to get to bottom of this, then I'll try and create one.

grondo commented 2 months ago

what's interesting is that the warning from valgrind only appears once after the very first power query. after that it disappears. One would think it should happen a lot, making think there is some side effect cause elsewhere. Or perhaps it only happens on the "copied" hostlist once but there is no issue on the original (or vice versa).

Did you check hl->size vs hl->nranges as discussed above? I think it is rare for the number of ranges to exactly match the size such that the invalid read occurs.

grondo commented 2 months ago

Sorry, that was just my debugging printf otuput. 399 was i->idx, 400 was i->hl->size and 400 was i->hl->nranges. The output was right before the ++i->idx.

That confirms the suspicion that this occurs when hl->size == hl->nranges. Another solution would be always ensure size > nranges, then i->hr will be set to NULL when going beyond nranges. This was probably the (incorrect) assumption in the code.

chu11 commented 2 months ago

That confirms the suspicion that this occurs when hl->size == hl->nranges. Another solution would be always ensure size > nranges, then i->hr will be set to NULL when going beyond nranges. This was probably the (incorrect) assumption in the code.

Ahhh the warning also goes away with this.

--- a/src/liblsd/hostlist.c
+++ b/src/liblsd/hostlist.c
@@ -1570,7 +1570,7 @@ hostlist_t hostlist_copy(const hostlist_t hl)
     new->nranges = hl->nranges;
     new->nhosts = hl->nhosts;
     if (new->nranges > new->size)
-        hostlist_resize(new, new->nranges);
+        hostlist_resize(new, new->nranges + HOSTLIST_CHUNK);

     for (i = 0; i < hl->nranges; i++)
         new->hr[i] = hostrange_copy(hl->hr[i]);

I'm guessing there is a copied hostlist somewhere in powerman that is used just once (or a few times) leading to the warning.

If we go with the rule that size should always be greater than nranges, then some fixes probably have to be done elsewhere. I notice

    if (hl->size == hl->nranges && !hostlist_expand(hl))                                                                                    
        goto error; 

in some places.

chu11 commented 2 months ago

@grondo this appears to hit the corner case in valgrind

#include <stdio.h>
#include <stdlib.h>

#include "hostlist.h"

int
main()
{
  hostlist_t hl = hostlist_create(NULL);
  hostlist_iterator_t itr;
  char *h;
  hostlist_push(hl, "fooA[0-80]");
  hostlist_push(hl, "fooB[0-80]");
  hostlist_push(hl, "fooC[0-80]");
  hostlist_push(hl, "fooD[0-80]");
  hostlist_push(hl, "fooE[0-80]");
  hostlist_push(hl, "fooF[0-80]");
  hostlist_push(hl, "fooG[0-80]");
  hostlist_push(hl, "fooH[0-80]");
  hostlist_push(hl, "fooI[0-80]");
  hostlist_push(hl, "fooJ[0-80]");
  hostlist_push(hl, "fooK[0-80]");
  hostlist_push(hl, "fooL[0-80]");
  hostlist_push(hl, "fooM[0-80]");
  hostlist_push(hl, "fooN[0-80]");
  hostlist_push(hl, "fooO[0-80]");
  hostlist_push(hl, "fooP[0-80]");
  itr = hostlist_iterator_create (hl);
  while ((h = hostlist_next (itr)))
      ;
  hostlist_iterator_destroy (itr);
  hostlist_destroy (hl);
}

basically pushing 16 chunks on the hostlist, equaling HOSTLIST_CHUNK internally.

This one hits along w/ hostlist_copy.

#include <stdio.h>
#include <stdlib.h>

#include "hostlist.h"

int
main()
{
  hostlist_t hl = hostlist_create(NULL);
  hostlist_t cpy;
  hostlist_iterator_t itr;
  char *h;
  hostlist_push(hl, "fooA[0-80]");
  hostlist_push(hl, "fooB[0-80]");
  hostlist_push(hl, "fooC[0-80]");
  hostlist_push(hl, "fooD[0-80]");
  hostlist_push(hl, "fooE[0-80]");
  hostlist_push(hl, "fooF[0-80]");
  hostlist_push(hl, "fooG[0-80]");
  hostlist_push(hl, "fooH[0-80]");
  hostlist_push(hl, "fooI[0-80]");
  hostlist_push(hl, "fooJ[0-80]");
  hostlist_push(hl, "fooK[0-80]");
  hostlist_push(hl, "fooL[0-80]");
  hostlist_push(hl, "fooM[0-80]");
  hostlist_push(hl, "fooN[0-80]");
  hostlist_push(hl, "fooO[0-80]");
  hostlist_push(hl, "fooP[0-80]");
  hostlist_push(hl, "fooQ[0-80]");
  cpy = hostlist_copy (hl);
  itr = hostlist_iterator_create (cpy);
  while ((h = hostlist_next (itr)))
      ;
  hostlist_iterator_destroy (itr);
  hostlist_destroy (hl);
}

key w/ the second test is to have > 16 hostranges, so that the copy makes the internal array exactly 17.

chu11 commented 2 months ago

@grondo

this appears to fix things using your hl->size > hl->nranges requirement.

diff --git a/src/liblsd/hostlist.c b/src/liblsd/hostlist.c
index 4324831c..aa4fbd00 100644
--- a/src/liblsd/hostlist.c
+++ b/src/liblsd/hostlist.c
@@ -1137,7 +1137,7 @@ static int hostlist_push_range(hostlist_t hl, hostrange_t hr)

     tail = (hl->nranges > 0) ? hl->hr[hl->nranges-1] : hl->hr[0];

-    if (hl->size == hl->nranges && !hostlist_expand(hl))
+    if ((hl->size - 1) == hl->nranges && !hostlist_expand(hl))
         goto error;

     if (hl->nranges > 0
@@ -1192,7 +1192,7 @@ static int hostlist_insert_range(hostlist_t hl, hostrange_t hr, int n)
     if (n > hl->nranges)
         return 0;

-    if (hl->size == hl->nranges && !hostlist_expand(hl))
+    if ((hl->size - 1) == hl->nranges && !hostlist_expand(hl))
         return 0;

     /* copy new hostrange into slot "n" in array */
@@ -1570,7 +1570,7 @@ hostlist_t hostlist_copy(const hostlist_t hl)
     new->nranges = hl->nranges;
     new->nhosts = hl->nhosts;
     if (new->nranges > new->size)
-        hostlist_resize(new, new->nranges);
+        hostlist_resize(new, new->nranges + 1);

     for (i = 0; i < hl->nranges; i++)
         new->hr[i] = hostrange_copy(hl->hr[i]);
@@ -2442,7 +2442,7 @@ static int hostset_insert_range(hostset_t set, hostrange_t hr)

     hl = set->hl;

-    if (hl->size == hl->nranges && !hostlist_expand(hl))
+    if ((hl->size - 1) == hl->nranges && !hostlist_expand(hl))
         return 0;

     nhosts = hostrange_count(hr);

So I think this or the prior

diff --git a/src/liblsd/hostlist.c b/src/liblsd/hostlist.c
index 2ab4c045..f5e163a5 100644
--- a/src/liblsd/hostlist.c
+++ b/src/liblsd/hostlist.c
@@ -2278,6 +2278,11 @@ static void _iterator_advance(hostlist_iterator_t i)
         return;
     if (++(i->depth) > (i->hr->hi - i->hr->lo)) {
         i->depth = 0;
+        if (i->idx >= (i->hl->size - 1)) {
+            ++i->idx;
+            i->hr = NULL;
+            return;
+        }
         i->hr = i->hl->hr[++i->idx];
     }
 }

both work.

do you have any preference to a fix?

grondo commented 2 months ago

Second one seems simpler :sweat_smile:

garlick commented 2 months ago

Suggestion: pull in the unit tests from lsd-tools.

Alternatively, convert to the hostlist in flux and bring in its unit tests.

Powerman doesn't have any hostlist unit tests that I can see.

chu11 commented 2 months ago

As an aside, there is a chance that this fix doesn't fix the segfault that happened with this original issue. They are in the same region, which is promsing (the hostlist within an arglist).

The only thing I can't figure out is that the valgrind warnings are basically on array out of bounds read errors occurring, but AFAICT the read is not used. Nothing indicates to me that an out of array bounds write occurs, leading to some corruption. But there's always the chance there's a side effect of the out of bounds access causes that I can't see b/c the i->hr is left non-NULL and pointing to something bad.

chu11 commented 2 months ago

Suggestion: pull in the unit tests from lsd-tools.

Good idea, although we have to dig it up. I don't know if it's in any repo we currently have online. The belief is that pdsh's version is the newest / best.

Alternatively, convert to the hostlist in flux and bring in its unit tests.

Was talking to @grondo about this, perhaps this is a longer term todo for not just powerman but many chaos projects. The API is different, so it would require changes everywhere.

chu11 commented 1 month ago

So the segfault was hit again, but the newer hostlist wasn't installed. I was able to get more debug information this time:

(gdb) bt
#0  snprintf (__fmt=0x555555572235 "%s", __n=1024, __s=0x7fffffffe700 "") at /usr/include/bits/stdio2.h:68
#1  _process_setresult (act=act@entry=0x555575a31710, e=<optimized out>, dev=<optimized out>, dev=<optimized out>) at device.c:1183
#2  0x000055555555b8cc in _process_stmt (timeout=0x7fffffffec10, e=0x5555781d55d0, act=0x555575a31710, dev=0x555558123a20) at device.c:945
#3  _process_action (timeout=<optimized out>, dev=<optimized out>) at device.c:871
#4  dev_post_poll (pfd=0x5555557b31e0, timeout=0x7fffffffec10) at device.c:1618
#5  0x00005555555572cb in _select_loop () at powermand.c:189
#6  main (argc=<optimized out>, argv=<optimized out>) at powermand.c:129
(gdb) up
#1  _process_setresult (act=act@entry=0x555575a31710, e=<optimized out>, dev=<optimized out>, dev=<optimized out>) at device.c:1183
1183                    snprintf(strbuf, sizeof(strbuf), "%s", arg->val);
(gdb) p arg
$1 = (Arg *) 0x0
(gdb) p *plug
$3 = {name = 0x5555581236b0 "Enclosure", node = 0x555558175740 "elcap-cmm390"}
(gdb) p *act->arglist
$6 = {args = 0x555577a937d0, hl = 0x5555557ab8d0, refcount = 9}
(gdb) p *act->arglist->hl
$7 = {magic = 57005, size = 16, nranges = 5, nhosts = 559, hr = 0x5555557b25e0, ilist = 0x0}
(gdb) p *(act->arglist->hl->hr[0])
$12 = {prefix = 0x5555765b99e0 "pelcap", lo = 201, hi = 416, width = 3, singlehost = 0}
(gdb) p *(act->arglist->hl->hr[1])
$13 = {prefix = 0x555577130020 "pelcap", lo = 425, hi = 764, width = 3, singlehost = 0}
(gdb) p *(act->arglist->hl->hr[2])
$14 = {prefix = 0x5555778c3c40 "pelcap", lo = 767, hi = 767, width = 3, singlehost = 0}
(gdb) p *(act->arglist->hl->hr[3])
$15 = {prefix = 0x5555774f3520 "pelcap", lo = 781, hi = 781, width = 3, singlehost = 0}
(gdb) p *(act->arglist->hl->hr[4])
$16 = {prefix = 0x5555769d5dc0 "pelcap", lo = 837, hi = 837, width = 3, singlehost = 0}
(gdb) p *(act->arglist->hl->hr[5])
Cannot access memory at address 0x0

(just memory refresher, here's offending code area)

            if ((arg = arglist_find(act->arglist, plug->node))) {
                arg->result = result;
                xfree(arg->val);
                arg->val = xstrdup(str);
            }

            if (result != RT_SUCCESS) {
                char strbuf[1024];
                snprintf(strbuf, sizeof(strbuf), "%s", arg->val);
                /* remove trailing carriage return or newline */
                strbuf[strcspn(strbuf, "\r\n")] = '\0';
                act->dpf_fun(act->client_id, "%s: %s", arg->node, strbuf);
            }

it appears that the search for elcap-cmm390 is happening in the wrong argument list (all the hosts are pelcap hots). Thus arg is NULL, leading to segfault in the snprintf() line.

To solve the segfault, I could easily put all of the second branch code inside the first one when we know arg != NULL. Or just check for arg != NULL && arg->val != NULL. However, I don't understand the consequences at the moment.

perhaps there is a subtlety in this area of code I don't understand. I notice in _process_setplugstate() is similarly does

            if ((arg = arglist_find(act->arglist, plug->node))) {
                arg->state = state;
                xfree(arg->val);
                arg->val = xstrdup(str);
            }

but it does not use arg afterwards. Under what scenarios would arglist_find() fail and it be "normal / ok".

chu11 commented 1 month ago

ok, i think I have an idea what's happening. I now understand that act->arglist is the list of inputted nodes that the user passed in on the command line, which is a mountain of pelcap nodes in this case.

elcap-cmm390 wasn't input on the command line, it is a "parent" node.

I'm betting there is a severe error of some sort on the parent of the pelcap node. That error is being output, thus powerman parses it, wants to "populate" that result. The user didn't input that host, thus didn't get found, thus segfault. Sure enough ...

(gdb) p str
$1 = 0x555579a8d970 "Couldn't connect to server\n"

So I think just checking for arg != NULL is more than enough to avoid the segfault. Should investigate other situations in redfishpower where these "severe error" outputs occur and should consider removing them.