Open andk opened 1 year ago
@sjaeckel do you have any idea what might went wrong in int fortuna_start(prng_state *prng)
The line of the segfault is https://github.com/DCIT/perl-CryptX/blob/master/src/ltc/prngs/fortuna.c#L234
The first thing that comes to my mind is that the allocated struct isn't big enough.
Could be because LTC_FORTUNA_POOLS
is different in the two compile units ... but otherwise ...
How can this be reproduced?
A fresh report with a more recent perl (5.38.0) that exposes the problem: http://www.cpantesters.org/cpan/report/d4da173a-1f77-11ee-a370-d61eba172296
Not every perl with similar configuration exposes the problem. But it seems like when you have a compilation that exhibits it, then it is reproducable. I just let this perl from the report above run the t/prng_fortuna.t test ~1000 times and the SEGV happened every time.
The stack trace for this perl looks practically the same as above:
Reading symbols from /home/sand/src/perl/repoperls/installed-perls/host/k93msid/v5.38.0/29fb/bin/perl...
[New LWP 2944018]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/sand/src/perl/repoperls/installed-perls/host/k93msid/v5.38.0/29fb/bin/per'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f778374b0f3 in fortuna_start (prng=0x556d8db50358) at ltc/prngs/fortuna.c:234
234 prng->u.fortuna.pool_idx = prng->u.fortuna.pool0_len = 0;
(gdb) bt
#0 0x00007f778374b0f3 in fortuna_start (prng=0x556d8db50358) at ltc/prngs/fortuna.c:234
#1 fortuna_start (prng=0x556d8db50358) at ltc/prngs/fortuna.c:217
#2 0x00007f7783717bed in XS_Crypt__PRNG_new (my_perl=0x556d8d1ba2a0, cv=<optimized out>) at ./inc/CryptX_PRNG.xs.inc:36
#3 0x0000556d8bdba514 in Perl_pp_entersub (my_perl=0x556d8d1ba2a0) at pp_hot.c:5555
#4 0x0000556d8bd6937a in Perl_runops_debug (my_perl=0x556d8d1ba2a0) at dump.c:2861
#5 0x0000556d8bc7d8b8 in S_run_body (oldscope=1, my_perl=0x556d8d1ba2a0) at perl.c:2812
#6 perl_run (my_perl=0x556d8d1ba2a0) at perl.c:2727
#7 0x0000556d8bc43475 in main (argc=<optimized out>, argv=<optimized out>, env=<optimized out>) at perlmain.c:127
How can I reproduce this locally? Can I somehow get access to this exact version that fails?
I tried it locally with the latest version and
$ perl --version
This is perl 5, version 38, subversion 0 (v5.38.0) built for x86_64-linux-thread-multi
[...]
$ make test
[...]
All tests successful.
Files=137, Tests=39024, 18 wallclock secs ( 1.20 usr 0.26 sys + 16.01 cusr 1.31 csys = 18.78 CPU)
Result: PASS
It is not easy to reproduce, I have tried to build perl-5.36.1 binary on Ubuntu-22.04 with the same options as in the original failing report:
./Configure \
-Dprefix=/home/miko/myperl-out \
-Dmyhostname=myhost \
-Dinstallusrbinperl=n \
-Uversiononly \
-Dusedevel \
-Ui_db \
-Dlibswanted='cl pthread socket inet nsl gdbm dbm malloc dl ld sun m crypt sec util c cposix posix ucb BSD gdbm_compat' \
-Duseithreads \
-Uuselongdouble \
-DEBUGGING=both \
-des
But I was unable to reproduce the failure in t/prng_fortuna.t test
.
I have been able to reproduce it. The problem is in these innocent looking lines.
prng->u.fortuna.pool_idx = prng->u.fortuna.pool0_len = 0;
prng->u.fortuna.reset_cnt = prng->u.fortuna.wd = 0;
Somehow those can result in a null-pointer dereference. I don't understand what's going on here either, it only happens with -O2
, with -O0
it runs fine. Is this a compiler bug, or are we missing something obvious that's undefined in C?
I worked around it by putting removing those two lines and using this instead (before initializing the pools)
memset(&prng->u.fortuna, '\0', sizeof(struct fortuna_prng));
Obviously, this is not a very satisfying fix.
@sjaeckel ^^^
@karel-m I'm already watching this issue :)
I have been able to reproduce it.
@Leont How?
I worked around it by [...]
memset(&prng->u.fortuna, '\0', sizeof(struct fortuna_prng));
TBH I would prefer to leave the fortuna code as it is and wait for the moment when someone solves the underlying problem, since that can't be the real solution. Or am I mistaken here?
Just for completeness here is a code fragment from my perl xs/c module, something may be wrong here:
typedef struct prng_struct { /* used by Crypt::PRNG */
prng_state state;
struct ltc_prng_descriptor *desc;
IV last_pid;
} *Crypt__PRNG;
/* ================================ */
Newz(0, RETVAL, 1, struct prng_struct); // memory allocation of prng_struct
if (!RETVAL) croak("FATAL: Newz failed");
id = cryptx_internal_find_prng(prng_name);
if (id == -1) {
Safefree(RETVAL);
croak("FATAL: find_prng failed for '%s'", prng_name);
}
RETVAL->last_pid = curpid;
RETVAL->desc = &prng_descriptor[id];
rv = RETVAL->desc->start(&RETVAL->state); // the crash
if (rv != CRYPT_OK) {
Safefree(RETVAL);
croak("FATAL: PRNG_start failed: %s", error_to_string(rv));
}
And it is also worth mentioning that the same code works without crash for Crypt::PRNG::ChaCha20 / Crypt::PRNG::RC4 / Crypt::PRNG::Sober128 / Crypt::PRNG::Yarrow the difference is only in id
returned by cryptx_internal_find_prng()
. Which supports the idea that it is fortuna specific.
IMO that code looks fine.
As pointed out by @Leont the crash also doesn't happen on the call of start()
but inside the function, which looks even stranger. I guess there's no way to find out what really goes wrong without having a reproducer of the crash and investigating in depth.
@Leont How?
I suspect the issue only occurs on debugging perls, I don't fully understand that because AFAICT that shouldn't affect the crypto code at all.
only occurs on debugging perls
That doesn't matter, it shouldn't happen. Please write down how it can be reproduced :)
While looking through the Perl internals regarding memory management ... Could this issue be related to mixing native and Perl-specific malloc/free calls? Using native malloc to allocate memory but Perl-free to free it or vice versa?
Have you ever thought of using the Perl-specific malloc/free calls inside ltc/ltm instead of the native ones? As the macro magic involved is quite extensive until you arrive at the really called Perl MM function I guess the easiest would be to trampoline those inside cryptx ...
void* cryptx_malloc(size_t sz)
{
Newz(0, RETVAL, 1, sz);
return RETVAL;
}
void cryptx_free(void *mem)
{
Safefree(mem);
}
/* etc. */
Then pre-define XMALLOC
etc. while compiling ltc -DXMALLOC=cryptx_malloc
.
Or do you already do that and I missed it while searching through the sources? :)
Then pre-define XMALLOC etc. while compiling ltc -DXMALLOC=cryptx_malloc.
That would be -DXMALLOC=PerlMem_malloc -DXFREE=PerlMem_free
etc…
But I don't think that's what's going on here.
Then pre-define XMALLOC etc. while compiling ltc -DXMALLOC=cryptx_malloc.
That would be
-DXMALLOC=PerlMem_malloc -DXFREE=PerlMem_free
etc…
https://github.com/Perl/perl5/blob/dd4eb78c55aab441aec1639b1dd49f88bd960831/perl.h#L1697-L1739
You're sure?
But I don't think that's what's going on here.
If nobody reveals how it can be reproduced I'm pretty sure we will never find out.
Please write down how it can be reproduced :)
If using perlbrew, compile a perl with «perl install perl-5.38.0 --debug --thread», and install the distribution on that perl.
If using perlbrew, compile a perl with «perl install perl-5.38.0 --debug --thread», and install the distribution on that perl.
It still doesn't fail on any of my machines... and with those two (slightly) different build configurations.
After looking through some of the failed builds on https://www.cpantesters.org/distro/C/CryptX.html I saw that all of the segfaults were on a machine called k93msid
... maybe there's something wrong on that box? Would it be possible to get SSH access to that machine?
@Leont you've been able to reproduce the issue on a machine that you have access to?
you've been able to reproduce the issue on a machine that you have access to?
Yes, I can reliably reproduce it on my computer.
Can you maybe tell me all the details of the tools you're using in the process? Which Distro and Compiler versions are you using? Can you please write down the exact command how you run all the tools? perlbrew etc.? Or could you maybe even create a docker image to reproduce this, based on your distro?
Or do you see another way how we can debug this?
karel-m added the should be fixed in libtomcrypt label
I can confirm I can not reproduce the issue with CryptX 0.080_006
@karel-m what does that label exactly mean? First I thought that an issue tagged with this label "is fixed in ltc". After having a second thought is it instead "depends on ltc to be fixed"? IIUC @Leont understood the former!? I'd now say it's the latter, because we didn't change anything relevant in ltc :)
@sjaeckel the label indicates that the issue requires a fix in the libtomcrypt sources (at least, that's my opinion, which you might not share :). Maybe I should rename it to "needs a fix in libtomcrypt."
FYI CryptX 0.080_006 = libtomcrypt current develop branch 12bf723b which includes many changes since CryptX 0.080. Interestingly, there were basically no changes to the Fortuna code, so I have no idea why the above reported issue seems to have disappeared.
at least, that's my opinion, which you might not share :).
I'm sharing your opinion and I doubt that the underlying issue is fixed.
@Leont which CPU model does the computer have you were seeing this on?
Maybe I should rename it to "needs a fix in libtomcrypt."
:+1:
@Leont which CPU model does the computer have you were seeing this on?
AMD Ryzen 5 3600 6-Core Processor. gcc version 14.1.1
OK, that CPU has AES-NI support.
... and is AES-NI even enabled? nevermind. I was just thinking aloud and I still don't get it where the problem could originate...
Sample fail report: http://www.cpantesters.org/cpan/report/9dadce3e-e8fc-11ed-a654-b70f1145618a
With that same perl I produced a core file and then got this stack trace: