gearman / gearmand

http://gearman.org/
Other
741 stars 138 forks source link

test t/protocol sometimes gets stuck #277

Open cheese opened 4 years ago

cheese commented 4 years ago

I am packaging 1.1.19.1 for Fedora and runs 'make test' during in %check session. I found sometimes test t/protocol gets stuck.

$ pstree -aTp 12260
make,12260 test
  └─make,12262 -j5 check-recursive
      └─sh,12264 -c...
          └─make,12269 -j5 check-am
              └─make,12275 -j5 check-TESTS check-local
                  └─sh,12282 -c...
                      └─make,12291 -j5 test-suite.log...
                          └─sh,13757 -c...
                              └─sh,13764 ./build-aux/test-driver --test-name t/protocol --log-file t/protocol.log --trs-file t/protocol.trs --color-tests no --enable-hard-errors yes ...
                                  └─lt-protocol,13765
                                      └─gearmand,13801 --verbose=INFO --log-file=var/log/gearmand.logXPGE8g --pid-file=var/run/gearmand.pidZDhXLg --port=22944 --listen=localhost

gearmand backtrace:

0x00007f1c0d0d1dfb in __lll_lock_wait_private () from target:/lib64/libc.so.6
(gdb) bt
#0  0x00007f1c0d0d1dfb in __lll_lock_wait_private () from target:/lib64/libc.so.6
#1  0x00007f1c0d0d739b in malloc () from target:/lib64/libc.so.6
#2  0x00007f1c0d159e56 in backtrace_symbols () from target:/lib64/libc.so.6
#3  0x000055c3f3ba1e1b in custom_backtrace () at libgearman/backtrace.cc:73
#4  0x000055c3f3b8df6d in _crash_handler (signal_=<optimized out>) at gearmand/gearmand.cc:558
#5  <signal handler called>
#6  0x00007f1c0d087625 in raise () from target:/lib64/libc.so.6
#7  0x00007f1c0d0708d9 in abort () from target:/lib64/libc.so.6
#8  0x00007f1c0d0cb4af in __libc_message () from target:/lib64/libc.so.6
#9  0x00007f1c0d0d2a9c in malloc_printerr () from target:/lib64/libc.so.6
#10 0x00007f1c0d0d49ce in _int_free () from target:/lib64/libc.so.6
#11 0x000055c3f3ba71ad in gearmand_thread_free (thread=0x55c3f3effc30) at libgearman-server/gearmand_thread.cc:359
#12 0x000055c3f3ba741e in gearmand_thread_free (thread=<optimized out>) at libgearman-server/gearmand_thread.cc:289
#13 0x000055c3f3ba319d in gearmand_free (gearmand=0x55c3f3effe20) at libgearman-server/gearmand.cc:292
#14 0x000055c3f3b8d630 in main (argc=<optimized out>, argv=<optimized out>) at gearmand/gearmand.cc:464

t/protocol.log:

protocol.GEARMAN_COMMAND_ECHO_REQ.GEARMAN_COMMAND_ECHO_REQ check                                        0:000289726 [ ok ]
protocol.GEARMAN_COMMAND_ECHO_REQ.GEARMAN_COMMAND_ECHO_REQ overrun                                      0:003474339 [ ok ]

var/log/gearmand.logXPGE8g:

   INFO 2020-02-18 17:25:34.109596 [  main ] Initializing Gear on port 22944 with SSL: false
   INFO 2020-02-18 17:25:34.000000 [  main ] Starting up with pid 24106, verbose is set to INFO
   INFO 2020-02-18 17:25:34.000000 [  main ] Listening on ::1:22944 (18)
   INFO 2020-02-18 17:25:34.000000 [  main ] Listening on 127.0.0.1:22944 (19)
   INFO 2020-02-18 17:25:34.000000 [  main ] Adding event for listening socket (18)
   INFO 2020-02-18 17:25:34.000000 [  main ] Adding event for listening socket (19)
   INFO 2020-02-18 17:25:34.000000 [  main ] Accepted connection from :::34454
   INFO 2020-02-18 17:25:34.000000 [     4 ] Peer connection has called close()
   INFO 2020-02-18 17:25:34.000000 [     4 ] Disconnected :::34454
   INFO 2020-02-18 17:25:34.000000 [     4 ] Gear connection disconnected: -:-
   INFO 2020-02-18 17:25:34.000000 [  main ] Accepted connection from ::3334:3435:3400:0:34456
   INFO 2020-02-18 17:25:34.000000 [     3 ] Peer connection has called close()
   INFO 2020-02-18 17:25:34.000000 [     3 ] Disconnected ::3334:3435:3400:0:34456
   INFO 2020-02-18 17:25:34.000000 [     3 ] Gear connection disconnected: -:-
   INFO 2020-02-18 17:25:34.000000 [  main ] Accepted connection from ::3334:3435:3600:0:34458
   INFO 2020-02-18 17:25:34.000000 [  main ] Clearing event for listening socket (18)
   INFO 2020-02-18 17:25:34.000000 [  main ] Clearing event for listening socket (19)
   INFO 2020-02-18 17:25:34.000000 [  main ] Closing listening socket (18)
   INFO 2020-02-18 17:25:34.000000 [  main ] Closing listening socket (19)
   INFO 2020-02-18 17:25:34.000000 [     2 ] Peer connection has called close()
   INFO 2020-02-18 17:25:34.000000 [     2 ] Disconnected ::3334:3435:3600:0:34458
   INFO 2020-02-18 17:25:34.000000 [     2 ] Disconnected ::3334:3435:3600:0:34458
esabol commented 4 years ago

Found this: https://stackoverflow.com/questions/2334352/why-do-i-get-a-sigabrt-here

cheese commented 4 years ago

How to change the log level for gearmand run by tests? So I can try to produce a more verbose log.

Or how to run gearmand under valgrind to detect memory issue?

esabol commented 4 years ago

How to change the log level for gearmand run by tests? So I can try to produce a more verbose log.

Or how to run gearmand under valgrind to detect memory issue?

I think I saw a target for running valgrind in the Makefile? Not sure....

esabol commented 4 years ago

Or how to run gearmand under valgrind to detect memory issue?

I think I saw a target for running valgrind in the Makefile? Not sure....

Yeah, type make valgrind. That echoes another command that starts with make check. If you copy and paste that command, it runs valgrind. There's also a make valgrind-supressions and a make helgrind. I'm not sure what the differences are with these.

Unfortunately, valgrind spews a bunch of errors like this:

Syscall param write(buf) points to uninitialised byte(s)
    at 0x57D5281: write (27/nptl/../sysdeps/unix/sysv/linux/write.c:27)
    by 0x57D4268: sem_open (27/nptl/sem_open.c:269)
    by 0x1270B0: libtest::SignalThread::setup() (/tmp/gearmand/libtest/signal.cc:248)

which, based on my reading of https://stackoverflow.com/questions/35578095/, are completely spurious. But nearly every test fails because of it. I'm still going through the weeds on this....

esabol commented 4 years ago

The valgrind-supressions [sic] command seems to suppress the Syscall param write(buf) points to uninitialised byte(s) bogus errors. That's probably what you want. All tests pass. That's good news!

helgrind has some FAILures though, and it seems to always hang in t/client. I see a lot of Possible data race during read of size X in the helgrind logs. I can't tell if those are spurious or not.