Closed petersilva closed 1 week ago
does not happen on my normal (22.04) PC.
occasionally happens on github CI/CD. EDIT: not the binary crash... just one test fails, might not be the same thing.
my desktop PC has been updated to 24.04, and now it happens there as well. using strace, found that it crashes in:
case 's':
ctx = EVP_MD_CTX_create();
md = EVP_sha512();
EVP_DigestInit_ex(ctx, md, NULL);
Somewhere in the bowels of EVP_DigestInit_ex() it seg faults. I added code to check the return values of the previous callse (ctx & md) and they aren't NULL, which is the only test we have available for validity.
strange.
I put a valgrind around it, to watch it crash, and it gives me kind of stack trace... fwiw:
==511070== 1 errors in context 1 of 3:
==511070== Invalid read of size 4
==511070== at 0x49616D5: __pthread_rwlock_rdlock_full64 (pthread_rwlock_common.c:298)
==511070== by 0x49616D5: pthread_rwlock_rdlock@@GLIBC_2.34 (pthread_rwlock_rdlock.c:26)
==511070== by 0x530F5FC: CRYPTO_THREAD_read_lock (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==511070== by 0x52FDCD3: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==511070== by 0x52D95C1: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==511070== by 0x52D9AE3: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==511070== by 0x52C27E1: EVP_MD_fetch (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==511070== by 0x52C3090: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==511070== by 0x4AF265E: sr_set_sumstr (sr_util.c:865)
==511070== by 0x4ADB124: sr_file2message_seq (sr_post.c:895)
==511070== by 0x4ADB33D: sr_post (sr_post.c:933)
==511070== by 0x4860276: exit_cleanup_posts (libsr3shim.c:1338)
==511070== by 0x4906371: __cxa_finalize (cxa_finalize.c:82)
==511070== Address 0x18 is not stack'd, malloc'd or (recently) free'd
==5
no clue...
so... I've noticed that there are these routines
ctx = EVP_MD_CTX_create();
md = EVP_sha512();
They are supposed to return pointers to these fixed data structures in memory. so the call should return the same point every time, but it doesn't. In the normal calls, the values are like this:
ctx=0x60be06c08950
md=0x79494cec5ba0
It looks like it crashed once with the same values... I see another case they are...
ctx=0x559159a1986
md=0x7024fe6c5ba0
before a crash. then when it crashes a second time, the values are like this:
tx=0x5800c20
md=0x54dcba0
A lot shorter... like the upper bits have been masked out or something.
I wrote a small program to call these routines 10 times... the pointer values are the same every time I call. so some kind of memory corruption?
OK, I have isolated it further... if I remove the copy_file_range entry point, the segfault goes away... so looking in there now.
the routine runs fine when called... does not seem to be doing anything wrong. the branch is here: https://github.com/MetPX/sarrac/tree/issue111_2_nevermerge_minimal_crasher
so in more detail:
OK... boiled stuff away until I got it to work... copy_file_range was failing... it was assuming that zero status was it worked, but this routine actually returns the number of bytes transferred, so 0 is failure and higher number is good. So had to re-arrange things to return the correct values.
Once I did that, ported that back to the initial fix branch, and.... it still crashes... working backwards, as soon as close and fclose entry points return, it starts crashing again. so the problem is likely at least two problems, one with copy_file_range seems to be fixed, but there are some more in close/fclose that need to be figured out.
ok the reason why it started working was because when copy_file_range returned an error properly, it would never try to post a second time in the same pid.
The problem seems to be that within a single pid, it goes throught the motions of trying to post a file twice, and on the second pass, something is corrupt.
The second call can be triggerred by presence of copy_file_range(), or close() entry points... being present... either one will call double calls.
There's nothing wrong with calling it twice. it should not crash anyways. So the fix above just prevents the condition that triggers the core dump from arising, does not fix anything.
fractal% grep -a file2message shim_post.log
2024-11-09 12:13:40,012 [DEBUG] shim file2message start with: /home/peter/test/hoho_my_darling.txt sb=0x7ffe947737d0 islnk=0, isdir=0, isreg=1
fractal%
when it crashes, there is a second call to file2message_start.
OK.. that seems to agree with older versions also... older versions are only trying to post the file once... so they don't crash. Repeated test on ubuntu 18 and it does not crash and only calls file2message start once.
as expected, if I remove atexit processing, it does not dump core either... (but we miss some posts in other circumstances... so it's a call to post something during atexit processing that triggers it.)
so this is the prototype branch: https://github.com/MetPX/sarrac/tree/issue111_3
that would be a merge candidate... but the underlying problem is not fixed. It just has fixes that might have been the cause, but aren't but are good anyways:
have this commit f17b47753e49820d36195fa8b345a110c5b7d372 which replaces the entire OPENSSL API in use to match the example from 22.04, but when I try it on ubuntu 18.04... it doesn't work, so had to revert. Didn't fix it anyways.
Making a similar replacement in the minimal case DID fix it.
OK, found a way to demonstrate there is a memory leak in the openssl stuff... using the old code (in issue111_3 branch)
vi Makefile # ... add --leak-check=full to valgrind line.
make
make test_shim_post
make trust_but_verify # might have to run two or three times... timing issue.
make trust_but_verify
there was a local_post.conf file missing that is needed by trust_but_verify. by invoking test_shim_post first, it finds that file and runs properly... leading us to reports of bytes being lost in the ssl library:
==2712403== 1 bytes in 1 blocks are indirectly lost in loss record 1 of 14
==2712403== at 0x4846828: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==2712403== by 0x5227270: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403== by 0x521754F: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403== by 0x52171E6: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403== by 0x5226F96: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403== by 0x5217337: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403== by 0x52176D8: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403== by 0x51F2804: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403== by 0x51F2AE3: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403== by 0x51DB7E1: EVP_MD_fetch (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403== by 0x51DC090: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==2712403== by 0x487B294: sr_set_sumstr (sr_util.c:854)
==2712403==
note that sr_util.c:854 is just: EVP_DigestInit_ex(ctx, md, NULL); This is repeated a number of times.
So... when this is called a couple of times, sometimes there is a core dump. What can I do about that?
oh... that commit replacing the usage of the openSSL stack? there is no leak with that f17b47753e49820d36195fa8b345a110c5b7d372
but it doesn't work on 18.04 (libssl-1.1 instead of 3.0.x), and it doesn't fix the crash. Hmm...
so one error in the tests, with a binary crashing. Don't know if it is the laptop, or the OS, or something in the environment... did a git bisection, and the issues is not new, it crashed in all versions going back three months.
looking at the log of the script itself:
no ideas what is special about this laptop... noting for later investigation.