SEGV with debugging perls with multiplicity on

andk commented 1 year ago

Sample fail report: http://www.cpantesters.org/cpan/report/9dadce3e-e8fc-11ed-a654-b70f1145618a

With that same perl I produced a core file and then got this stack trace:

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/sand/src/perl/repoperls/installed-perls/host/k93msid/v5.36.1/29fb/bin/per'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fae1a749e03 in fortuna_start (prng=0x55a6156dd838) at ltc/prngs/fortuna.c:234
234        prng->u.fortuna.pool_idx = prng->u.fortuna.pool0_len = 0;
(gdb) bt
#0  0x00007fae1a749e03 in fortuna_start (prng=0x55a6156dd838) at ltc/prngs/fortuna.c:234
#1  fortuna_start (prng=0x55a6156dd838) at ltc/prngs/fortuna.c:217
#2  0x00007fae1a6f3552 in XS_Crypt__PRNG_new (my_perl=0x55a614cb02a0, cv=<optimized out>) at ./inc/CryptX_PRNG.xs.inc:36
#3  0x000055a61402a834 in Perl_pp_entersub (my_perl=0x55a614cb02a0) at pp_hot.c:5353
#4  0x000055a613fdf0ea in Perl_runops_debug (my_perl=0x55a614cb02a0) at dump.c:2677
#5  0x000055a613f2d999 in S_run_body (oldscope=1, my_perl=0x55a614cb02a0) at perl.c:2721
#6  perl_run (my_perl=0x55a614cb02a0) at perl.c:2644
#7  0x000055a613eee46e in main (argc=<optimized out>, argv=<optimized out>, env=<optimized out>) at perlmain.c:110

karel-m commented 1 year ago

@sjaeckel do you have any idea what might went wrong in int fortuna_start(prng_state *prng)

The line of the segfault is https://github.com/DCIT/perl-CryptX/blob/master/src/ltc/prngs/fortuna.c#L234

sjaeckel commented 1 year ago

The first thing that comes to my mind is that the allocated struct isn't big enough.

Could be because LTC_FORTUNA_POOLS is different in the two compile units ... but otherwise ...

How can this be reproduced?

andk commented 1 year ago

A fresh report with a more recent perl (5.38.0) that exposes the problem: http://www.cpantesters.org/cpan/report/d4da173a-1f77-11ee-a370-d61eba172296

Not every perl with similar configuration exposes the problem. But it seems like when you have a compilation that exhibits it, then it is reproducable. I just let this perl from the report above run the t/prng_fortuna.t test ~1000 times and the SEGV happened every time.

The stack trace for this perl looks practically the same as above:

Reading symbols from /home/sand/src/perl/repoperls/installed-perls/host/k93msid/v5.38.0/29fb/bin/perl...
[New LWP 2944018]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/sand/src/perl/repoperls/installed-perls/host/k93msid/v5.38.0/29fb/bin/per'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f778374b0f3 in fortuna_start (prng=0x556d8db50358) at ltc/prngs/fortuna.c:234
234        prng->u.fortuna.pool_idx = prng->u.fortuna.pool0_len = 0;
(gdb) bt
#0  0x00007f778374b0f3 in fortuna_start (prng=0x556d8db50358) at ltc/prngs/fortuna.c:234
#1  fortuna_start (prng=0x556d8db50358) at ltc/prngs/fortuna.c:217
#2  0x00007f7783717bed in XS_Crypt__PRNG_new (my_perl=0x556d8d1ba2a0, cv=<optimized out>) at ./inc/CryptX_PRNG.xs.inc:36
#3  0x0000556d8bdba514 in Perl_pp_entersub (my_perl=0x556d8d1ba2a0) at pp_hot.c:5555
#4  0x0000556d8bd6937a in Perl_runops_debug (my_perl=0x556d8d1ba2a0) at dump.c:2861
#5  0x0000556d8bc7d8b8 in S_run_body (oldscope=1, my_perl=0x556d8d1ba2a0) at perl.c:2812
#6  perl_run (my_perl=0x556d8d1ba2a0) at perl.c:2727
#7  0x0000556d8bc43475 in main (argc=<optimized out>, argv=<optimized out>, env=<optimized out>) at perlmain.c:127

sjaeckel commented 1 year ago

How can I reproduce this locally? Can I somehow get access to this exact version that fails?

I tried it locally with the latest version and

$ perl --version

This is perl 5, version 38, subversion 0 (v5.38.0) built for x86_64-linux-thread-multi
[...]
$ make test
[...]
All tests successful.
Files=137, Tests=39024, 18 wallclock secs ( 1.20 usr  0.26 sys + 16.01 cusr  1.31 csys = 18.78 CPU)
Result: PASS

karel-m commented 1 year ago

It is not easy to reproduce, I have tried to build perl-5.36.1 binary on Ubuntu-22.04 with the same options as in the original failing report:

./Configure \
    -Dprefix=/home/miko/myperl-out \
    -Dmyhostname=myhost \
    -Dinstallusrbinperl=n \
    -Uversiononly \
    -Dusedevel \
    -Ui_db \
    -Dlibswanted='cl pthread socket inet nsl gdbm dbm malloc dl ld sun m crypt sec util c cposix posix ucb BSD gdbm_compat' \
    -Duseithreads \
    -Uuselongdouble \
    -DEBUGGING=both \
    -des

But I was unable to reproduce the failure in t/prng_fortuna.t test.

Leont commented 1 year ago

I have been able to reproduce it. The problem is in these innocent looking lines.

   prng->u.fortuna.pool_idx = prng->u.fortuna.pool0_len = 0;
   prng->u.fortuna.reset_cnt = prng->u.fortuna.wd = 0;

Somehow those can result in a null-pointer dereference. I don't understand what's going on here either, it only happens with -O2, with -O0 it runs fine. Is this a compiler bug, or are we missing something obvious that's undefined in C?

I worked around it by putting removing those two lines and using this instead (before initializing the pools)

memset(&prng->u.fortuna, '\0', sizeof(struct fortuna_prng));

Obviously, this is not a very satisfying fix.

karel-m commented 1 year ago

@sjaeckel ^^^

sjaeckel commented 1 year ago

@karel-m I'm already watching this issue :)

I have been able to reproduce it.

@Leont How?

I worked around it by [...]
memset(&prng->u.fortuna, '\0', sizeof(struct fortuna_prng));

TBH I would prefer to leave the fortuna code as it is and wait for the moment when someone solves the underlying problem, since that can't be the real solution. Or am I mistaken here?

karel-m commented 1 year ago

Just for completeness here is a code fragment from my perl xs/c module, something may be wrong here:

typedef struct prng_struct {            /* used by Crypt::PRNG */
  prng_state state;
  struct ltc_prng_descriptor *desc;
  IV last_pid;
} *Crypt__PRNG;

/* ================================ */

        Newz(0, RETVAL, 1, struct prng_struct);                        // memory allocation of prng_struct
        if (!RETVAL) croak("FATAL: Newz failed");

        id = cryptx_internal_find_prng(prng_name);
        if (id == -1) {
          Safefree(RETVAL);
          croak("FATAL: find_prng failed for '%s'", prng_name);
        }
        RETVAL->last_pid = curpid;
        RETVAL->desc = &prng_descriptor[id];

        rv = RETVAL->desc->start(&RETVAL->state);                     // the crash
        if (rv != CRYPT_OK) {
          Safefree(RETVAL);
          croak("FATAL: PRNG_start failed: %s", error_to_string(rv));
        }

karel-m commented 1 year ago

And it is also worth mentioning that the same code works without crash for Crypt::PRNG::ChaCha20 / Crypt::PRNG::RC4 / Crypt::PRNG::Sober128 / Crypt::PRNG::Yarrow the difference is only in id returned by cryptx_internal_find_prng(). Which supports the idea that it is fortuna specific.

sjaeckel commented 1 year ago

https://github.com/DCIT/perl-CryptX/blob/fc61205e5fd4a464c0d69ff1440436a63693fbf4/CryptX.xs#L121-L125

https://github.com/DCIT/perl-CryptX/blob/fc61205e5fd4a464c0d69ff1440436a63693fbf4/inc/CryptX_PRNG.xs.inc#L25-L40

IMO that code looks fine.

As pointed out by @Leont the crash also doesn't happen on the call of start() but inside the function, which looks even stranger. I guess there's no way to find out what really goes wrong without having a reproducer of the crash and investigating in depth.

Leont commented 1 year ago

@Leont How?

I suspect the issue only occurs on debugging perls, I don't fully understand that because AFAICT that shouldn't affect the crypto code at all.

sjaeckel commented 1 year ago

only occurs on debugging perls

That doesn't matter, it shouldn't happen. Please write down how it can be reproduced :)

sjaeckel commented 1 year ago

While looking through the Perl internals regarding memory management ... Could this issue be related to mixing native and Perl-specific malloc/free calls? Using native malloc to allocate memory but Perl-free to free it or vice versa?

Have you ever thought of using the Perl-specific malloc/free calls inside ltc/ltm instead of the native ones? As the macro magic involved is quite extensive until you arrive at the really called Perl MM function I guess the easiest would be to trampoline those inside cryptx ...

void* cryptx_malloc(size_t sz)
{
   Newz(0, RETVAL, 1, sz);
   return RETVAL;
}
void cryptx_free(void *mem)
{
   Safefree(mem);
}
/* etc. */

Then pre-define XMALLOC etc. while compiling ltc -DXMALLOC=cryptx_malloc.

Or do you already do that and I missed it while searching through the sources? :)

Leont commented 1 year ago

Then pre-define XMALLOC etc. while compiling ltc -DXMALLOC=cryptx_malloc.

That would be -DXMALLOC=PerlMem_malloc -DXFREE=PerlMem_free etc…

But I don't think that's what's going on here.

sjaeckel commented 1 year ago

Then pre-define XMALLOC etc. while compiling ltc -DXMALLOC=cryptx_malloc.

That would be -DXMALLOC=PerlMem_malloc -DXFREE=PerlMem_free etc…

https://github.com/Perl/perl5/blob/dd4eb78c55aab441aec1639b1dd49f88bd960831/perl.h#L1697-L1739

You're sure?

But I don't think that's what's going on here.

If nobody reveals how it can be reproduced I'm pretty sure we will never find out.

Leont commented 1 year ago

Please write down how it can be reproduced :)

If using perlbrew, compile a perl with «perl install perl-5.38.0 --debug --thread», and install the distribution on that perl.

sjaeckel commented 1 year ago

If using perlbrew, compile a perl with «perl install perl-5.38.0 --debug --thread», and install the distribution on that perl.

perl_V.txt

perl_V2.txt

It still doesn't fail on any of my machines... and with those two (slightly) different build configurations.

After looking through some of the failed builds on https://www.cpantesters.org/distro/C/CryptX.html I saw that all of the segfaults were on a machine called k93msid ... maybe there's something wrong on that box? Would it be possible to get SSH access to that machine?

sjaeckel commented 1 year ago

@Leont you've been able to reproduce the issue on a machine that you have access to?

Leont commented 1 year ago

you've been able to reproduce the issue on a machine that you have access to?

Yes, I can reliably reproduce it on my computer.

sjaeckel commented 1 year ago

Can you maybe tell me all the details of the tools you're using in the process? Which Distro and Compiler versions are you using? Can you please write down the exact command how you run all the tools? perlbrew etc.? Or could you maybe even create a docker image to reproduce this, based on your distro?

Or do you see another way how we can debug this?

Leont commented 2 months ago

karel-m added the should be fixed in libtomcrypt label

I can confirm I can not reproduce the issue with CryptX 0.080_006

sjaeckel commented 2 months ago

@karel-m what does that label exactly mean? First I thought that an issue tagged with this label "is fixed in ltc". After having a second thought is it instead "depends on ltc to be fixed"? IIUC @Leont understood the former!? I'd now say it's the latter, because we didn't change anything relevant in ltc :)

karel-m commented 2 months ago

@sjaeckel the label indicates that the issue requires a fix in the libtomcrypt sources (at least, that's my opinion, which you might not share :). Maybe I should rename it to "needs a fix in libtomcrypt."

FYI CryptX 0.080_006 = libtomcrypt current develop branch 12bf723b which includes many changes since CryptX 0.080. Interestingly, there were basically no changes to the Fortuna code, so I have no idea why the above reported issue seems to have disappeared.

sjaeckel commented 2 months ago

at least, that's my opinion, which you might not share :).

I'm sharing your opinion and I doubt that the underlying issue is fixed.

@Leont which CPU model does the computer have you were seeing this on?

Maybe I should rename it to "needs a fix in libtomcrypt."

:+1:

Leont commented 2 months ago

@Leont which CPU model does the computer have you were seeing this on?

AMD Ryzen 5 3600 6-Core Processor. gcc version 14.1.1

sjaeckel commented 2 months ago

OK, that CPU has AES-NI support.

... and is AES-NI even enabled? nevermind. I was just thinking aloud and I still don't get it where the problem could originate...

DCIT / perl-CryptX

SEGV with debugging perls with multiplicity on #90