MetPX / sarrac

C implementation of (a subset of) Sarracenia (large scale file transfer utility)
GNU General Public License v2.0
4 stars 1 forks source link

test_shim_post coredump on later ubuntu versions. #111

Closed petersilva closed 1 week ago

petersilva commented 1 year ago

blacklab% make test_shim_post
Makefile:50: warning: ignoring prerequisites on suffix rule definition
./shim_post.sh >shim_post.log 2>&1
#python3 ./check_shim_post.py shim_post.log exit_on_bad
python3 ./check_shim_test.py shim_post.log
+.
 .
 .
2023-05-09 16:47:42,070 [INFO] published: { "pubTime":"20230509204742.0688027", "baseUrl":"file:/home/peter", "relPath":"test/hoho_my_darling.txt", "topic":"v03.post.test", "integrity":{  "method" : "sha512", "value" : "vuIxy9N84gi1FNtJ++ahgMQN+gAoyf8evK6GIAAgsva4IFlqL+37NVNrjImaMa/jtrKN6Qi+W/Ahcs\nqKMRvz0w=="  } , "mtime":"20230509204742.0619578", "atime":"20230509204742.0619578", "mode":"0664", "size":"40238", "toto":"pig"}

RESULT: BAD! binary crashed. 
RESULT: Good! 1 sha512 posts from 070 cp command. as expected
+ touch hihi
 .
 .
 .
RESULT: Good! 2 remove posts from 210 removing two files. as expected
RESULT: comment 220 shim test posting end
posts: {'sha512': 7, 'directory': 3, 'rename': 5, 'rmdir': 1, 'link': 1, 'remove': 6}
RESULT: summary: good: 21, bad 1, total 21
make: *** [Makefile:127: test_shim_post] Error 1
blacklab% 

so one error in the tests, with a binary crashing. Don't know if it is the laptop, or the OS, or something in the environment... did a git bisection, and the issues is not new, it crashed in all versions going back three months.

looking at the log of the script itself:

  .
  .
  .
023-05-09 16:47:42,071 [DEBUG] realpath_adjust -2, /home/peter/Sarracenia -> /home/peter/Sarracenia
2023-05-09 16:47:42,071 [DEBUG] sr_isMatchingPattern, testing string: /home/peter/Sarracenia/metpx-sr3c/shim_post.log vs mask: accept /home/peter/Sarracenia/metpx-sr3c/.* next=0x562c7a0cbba0
2023-05-09 16:47:42,071 [DEBUG] sr_isMatchingPattern: /home/peter/Sarracenia/metpx-sr3c/shim_post.log matched mask: accept /home/peter/Sarracenia/metpx-sr3c/.*
SR_SHIMDEBUG 1 1142830 0.0713236 srshim_realpost accepted... /home/peter/Sarracenia/metpx-sr3c/shim_post.log now
SR_SHIMDEBUG 1 1142830 0.0713277 srshim_realpost rejecting should_not_post... /home/peter/Sarracenia/metpx-sr3c/shim_post.log
SR_SHIMDEBUG 3 1142830 0.071331 shim re-enabled after post of /home/peter/Sarracenia/metpx-sr3c/shim_post.log
SR_SHIMDEBUG 5 1142830 0.0713353  fclose 0x7f8e68df76a0 fd=2 fdstat=1758506608, starting
Segmentation fault (core dumped)
+ echo #test 1 sha512 080 touch command
#test 1 sha512 080 touch command
+ touch hihi
 startup SR_
  .
 .
 .

no ideas what is special about this laptop... noting for later investigation.

petersilva commented 1 year ago

does not happen on my normal (22.04) PC.
occasionally happens on github CI/CD. EDIT: not the binary crash... just one test fails, might not be the same thing.

petersilva commented 1 month ago

my desktop PC has been updated to 24.04, and now it happens there as well. using strace, found that it crashes in:

        case 's':
                ctx = EVP_MD_CTX_create();
                md = EVP_sha512();
                EVP_DigestInit_ex(ctx, md, NULL);

Somewhere in the bowels of EVP_DigestInit_ex() it seg faults. I added code to check the return values of the previous callse (ctx & md) and they aren't NULL, which is the only test we have available for validity.

strange.

petersilva commented 1 month ago

I put a valgrind around it, to watch it crash, and it gives me kind of stack trace... fwiw:

==511070== 1 errors in context 1 of 3:
==511070== Invalid read of size 4
==511070==    at 0x49616D5: __pthread_rwlock_rdlock_full64 (pthread_rwlock_common.c:298)
==511070==    by 0x49616D5: pthread_rwlock_rdlock@@GLIBC_2.34 (pthread_rwlock_rdlock.c:26)
==511070==    by 0x530F5FC: CRYPTO_THREAD_read_lock (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==511070==    by 0x52FDCD3: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==511070==    by 0x52D95C1: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==511070==    by 0x52D9AE3: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==511070==    by 0x52C27E1: EVP_MD_fetch (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==511070==    by 0x52C3090: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==511070==    by 0x4AF265E: sr_set_sumstr (sr_util.c:865)
==511070==    by 0x4ADB124: sr_file2message_seq (sr_post.c:895)
==511070==    by 0x4ADB33D: sr_post (sr_post.c:933)
==511070==    by 0x4860276: exit_cleanup_posts (libsr3shim.c:1338)
==511070==    by 0x4906371: __cxa_finalize (cxa_finalize.c:82)
==511070==  Address 0x18 is not stack'd, malloc'd or (recently) free'd
==5

no clue...

petersilva commented 1 month ago

so... I've noticed that there are these routines

       ctx = EVP_MD_CTX_create();
           md = EVP_sha512();

They are supposed to return pointers to these fixed data structures in memory. so the call should return the same point every time, but it doesn't. In the normal calls, the values are like this:

ctx=0x60be06c08950
md=0x79494cec5ba0

It looks like it crashed once with the same values... I see another case they are...


ctx=0x559159a1986
md=0x7024fe6c5ba0

before a crash. then when it crashes a second time, the values are like this:

tx=0x5800c20
md=0x54dcba0

A lot shorter... like the upper bits have been masked out or something.

I wrote a small program to call these routines 10 times... the pointer values are the same every time I call. so some kind of memory corruption?

petersilva commented 3 weeks ago

OK, I have isolated it further... if I remove the copy_file_range entry point, the segfault goes away... so looking in there now.

petersilva commented 3 weeks ago

the routine runs fine when called... does not seem to be doing anything wrong. the branch is here: https://github.com/MetPX/sarrac/tree/issue111_2_nevermerge_minimal_crasher

petersilva commented 3 weeks ago

so in more detail:

petersilva commented 3 weeks ago

OK... boiled stuff away until I got it to work... copy_file_range was failing... it was assuming that zero status was it worked, but this routine actually returns the number of bytes transferred, so 0 is failure and higher number is good. So had to re-arrange things to return the correct values.

Once I did that, ported that back to the initial fix branch, and.... it still crashes... working backwards, as soon as close and fclose entry points return, it starts crashing again. so the problem is likely at least two problems, one with copy_file_range seems to be fixed, but there are some more in close/fclose that need to be figured out.

petersilva commented 3 weeks ago

ok the reason why it started working was because when copy_file_range returned an error properly, it would never try to post a second time in the same pid.

The problem seems to be that within a single pid, it goes throught the motions of trying to post a file twice, and on the second pass, something is corrupt.

The second call can be triggerred by presence of copy_file_range(), or close() entry points... being present... either one will call double calls.

There's nothing wrong with calling it twice. it should not crash anyways. So the fix above just prevents the condition that triggers the core dump from arising, does not fix anything.

fractal% grep -a file2message shim_post.log 
2024-11-09 12:13:40,012 [DEBUG] shim file2message start with: /home/peter/test/hoho_my_darling.txt sb=0x7ffe947737d0 islnk=0, isdir=0, isreg=1
fractal% 

when it crashes, there is a second call to file2message_start.

petersilva commented 3 weeks ago

OK.. that seems to agree with older versions also... older versions are only trying to post the file once... so they don't crash. Repeated test on ubuntu 18 and it does not crash and only calls file2message start once.

petersilva commented 3 weeks ago

as expected, if I remove atexit processing, it does not dump core either... (but we miss some posts in other circumstances... so it's a call to post something during atexit processing that triggers it.)

petersilva commented 3 weeks ago

so this is the prototype branch: https://github.com/MetPX/sarrac/tree/issue111_3

that would be a merge candidate... but the underlying problem is not fixed. It just has fixes that might have been the cause, but aren't but are good anyways:

petersilva commented 3 weeks ago

have this commit f17b47753e49820d36195fa8b345a110c5b7d372 which replaces the entire OPENSSL API in use to match the example from 22.04, but when I try it on ubuntu 18.04... it doesn't work, so had to revert. Didn't fix it anyways.

Making a similar replacement in the minimal case DID fix it.

petersilva commented 3 weeks ago

OK, found a way to demonstrate there is a memory leak in the openssl stuff... using the old code (in issue111_3 branch)


  vi Makefile   # ... add --leak-check=full to valgrind line.
   make 
  make test_shim_post
  make trust_but_verify  # might have to run two or three times... timing issue.
  make trust_but_verify

there was a local_post.conf file missing that is needed by trust_but_verify. by invoking test_shim_post first, it finds that file and runs properly... leading us to reports of bytes being lost in the ssl library:


==2712403== 1 bytes in 1 blocks are indirectly lost in loss record 1 of 14
==2712403==    at 0x4846828: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==2712403==    by 0x5227270: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403==    by 0x521754F: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403==    by 0x52171E6: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403==    by 0x5226F96: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403==    by 0x5217337: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403==    by 0x52176D8: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403==    by 0x51F2804: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403==    by 0x51F2AE3: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403==    by 0x51DB7E1: EVP_MD_fetch (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403==    by 0x51DC090: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403==    by 0x487B294: sr_set_sumstr (sr_util.c:854)
==2712403== 

note that sr_util.c:854 is just: EVP_DigestInit_ex(ctx, md, NULL); This is repeated a number of times.

So... when this is called a couple of times, sometimes there is a core dump. What can I do about that?

petersilva commented 3 weeks ago

oh... that commit replacing the usage of the openSSL stack? there is no leak with that f17b47753e49820d36195fa8b345a110c5b7d372

but it doesn't work on 18.04 (libssl-1.1 instead of 3.0.x), and it doesn't fix the crash. Hmm...