dCache / s2

s2 tests - SRM
Other
0 stars 1 forks source link

s2 crashes when making concurrent SRM requests #1

Open paulmillar opened 8 years ago

paulmillar commented 8 years ago

There is a race condition on a code-path that is not thread-safe. The SRM 2.2. usecase test PutOverwriteTransfParallel.s2 used to issue SRM requests on separate threads. It has subsequently been updated as doing this was an error in the test.

The effect of the race condition is the s2 program crashes with a "double free or corruption" error message. How often this problem is triggered appears to depend on the number of cores the client machine has. On test (virtual) machines with two cores, the crash appears with PutOverwriteTransfParallal.s2 only occasionally; on my desktop machine (8 cores) the problem seems to appear with every invocation of PutOverwriteTransfParallal.s2.

Running the test with gdb yields the following stack-trace

-*** Error in `/home/paul/git/s2/src/s2': double free or corruption (!prev): 0x0000000000857050 ***

Program received signal SIGABRT, Aborted.
0x00007ffff634a067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff634a067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff634b448 in __GI_abort () at abort.c:89
#2  0x00007ffff63881b4 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7ffff647d530 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff638d98e in malloc_printerr (action=1, str=0x7ffff647d638 "double free or corruption (!prev)", ptr=<optimized out>) at malloc.c:4996
#4  0x00007ffff638e696 in _int_free (av=av@entry=0x7ffff66ba620 <main_arena>, p=<optimized out>, p@entry=0x857040, have_lock=have_lock@entry=1) at malloc.c:3840
#5  0x00007ffff63906c0 in _int_realloc (av=av@entry=0x7ffff66ba620 <main_arena>, oldp=oldp@entry=0x857040, oldsize=oldsize@entry=8208, nb=nb@entry=16400)
    at malloc.c:4340
#6  0x00007ffff6391769 in __GI___libc_realloc (oldmem=0x857050, bytes=16384) at malloc.c:3029
#7  0x00007ffff5d2c3e0 in CRYPTO_realloc () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#8  0x00007ffff5db637c in lh_insert () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#9  0x00007ffff5db8a5a in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#10 0x00007ffff5db842b in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#11 0x00007ffff5db9eb1 in ERR_load_crypto_strings () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#12 0x00007ffff60f18c9 in SSL_load_error_strings () from /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
#13 0x00007ffff7bb5582 in soap_ssl_init () from /usr/lib/x86_64-linux-gnu/libgsoapssl++.so.5
#14 0x00007ffff7bb8cb7 in soap_init_LIBRARY_VERSION_REQUIRED_20817 () from /usr/lib/x86_64-linux-gnu/libgsoapssl++.so.5
#15 0x00007ffff7bb8e88 in soap_new_LIBRARY_VERSION_REQUIRED_20817 () from /usr/lib/x86_64-linux-gnu/libgsoapssl++.so.5
#16 0x00000000004b303c in srmPrepareToPut::exec (this=0x8414e0, proc=0x7fffffffae10) at n_srmPrepareToPut.cpp:136
#17 0x000000000040bbcd in Process::exec_with_timeout (this=0x7fffffffae10) at process.cpp:870
#18 0x000000000040c04d in Process::eval_with_timeout (this=0x7fffffffae10) at process.cpp:979
#19 0x000000000040b6aa in Process::eval_sequential_repeats (this=0x7fffffffae10) at process.cpp:705
#20 0x000000000040ac2a in Process::eval_repeats (this=0x7fffffffae10) at process.cpp:500
#21 0x000000000040a843 in Process::eval (this=0x7fffffffae10) at process.cpp:447
#22 0x000000000040b43b in Process::eval_par (this=0x7fffffffb310) at process.cpp:641
#23 0x000000000040b8c5 in Process::eval_sequential_repeats (this=0x7fffffffb310) at process.cpp:778
#24 0x000000000040ac2a in Process::eval_repeats (this=0x7fffffffb310) at process.cpp:500
#25 0x000000000040a843 in Process::eval (this=0x7fffffffb310) at process.cpp:447
#26 0x000000000040c507 in Process::eval_subtree (this=0x7fffffffb920, root_exec=0) at process.cpp:1046
#27 0x000000000040c3d1 in Process::eval_with_timeout (this=0x7fffffffb920) at process.cpp:1022
#28 0x000000000040b6aa in Process::eval_sequential_repeats (this=0x7fffffffb920) at process.cpp:705
#29 0x000000000040ac2a in Process::eval_repeats (this=0x7fffffffb920) at process.cpp:500
#30 0x000000000040a843 in Process::eval (this=0x7fffffffb920) at process.cpp:447
#31 0x000000000040b43b in Process::eval_par (this=0x7fffffffbe20) at process.cpp:641
#32 0x000000000040b8c5 in Process::eval_sequential_repeats (this=0x7fffffffbe20) at process.cpp:778
#33 0x000000000040ac2a in Process::eval_repeats (this=0x7fffffffbe20) at process.cpp:500
#34 0x000000000040a843 in Process::eval (this=0x7fffffffbe20) at process.cpp:447
#35 0x000000000040b43b in Process::eval_par (this=0x7fffffffc320) at process.cpp:641
#36 0x000000000040b8c5 in Process::eval_sequential_repeats (this=0x7fffffffc320) at process.cpp:778
#37 0x000000000040ac2a in Process::eval_repeats (this=0x7fffffffc320) at process.cpp:500
#38 0x000000000040a843 in Process::eval (this=0x7fffffffc320) at process.cpp:447
#39 0x000000000040b43b in Process::eval_par (this=0x83e120) at process.cpp:641
#40 0x000000000040b8c5 in Process::eval_sequential_repeats (this=0x83e120) at process.cpp:778
#41 0x000000000040ac2a in Process::eval_repeats (this=0x83e120) at process.cpp:500
#42 0x000000000040a843 in Process::eval (this=0x83e120) at process.cpp:447
#43 0x000000000040719e in s2_run (argc=10, argv=0x7fffffffc988, i=10) at s2.cpp:977
#44 0x00000000004072db in main (argc=10, argv=0x7fffffffc988) at s2.cpp:1022
paulmillar commented 8 years ago

Minimal testcase to demonstrate this problem has been added as testing/scripts/protos/srm/2.2/testcase/issue-1.s2

paulmillar commented 8 years ago

Problem was due to at least three problems:

  1. gSOAP and Globus SSL support both initialising OpenSSL (which is racey)
  2. Globus initialisation is racey,
  3. Calls to gss_init_sec_context appear to be racey with GLOBUS_OPENSSL_MODULE.

The first problem can be fixed in s2 code.

Problems 2. and 3. can be worked around within CGSI_PLUGIN, but should be fixed in Globus.