Possible memory corruption in pow(?)

Quickcheck runs will very occasionally fail with an error like

thread 'safe' panicked at 'called `Result::unwrap()` on an `Err` value: ParseIntError { kind: InvalidDigit }', ../src/libcore/result.rs:736

which appears to be caused by the &str being parsed having been corrupted (there's no way the generator will generate an incorrect digit). I managed to catch this with rr with an opt+debuginfo build, which seems to point the finger at https://github.com/Aatch/ramp/blob/8c6676202ab416a636d5f9079e439fca442e6982/src/ll/mul.rs#L420-L422 but the optimisations mean a lot of stuff is optimised out.

The following is the exact pow call in the trace I've found. The corruption appears to occur when exp is 1 in https://github.com/Aatch/ramp/blob/8c6676202ab416a636d5f9079e439fca442e6982/src/ll/pow.rs#L70-L94 .

extern crate ramp;
use ramp::Int;

fn parse(s: &str) -> Int {
    Int::from_str_radix(s, 16).unwrap()
}
fn main() {
    let ar = parse("3283bc9fffffb00000000000000000000000000000000005fffffffffffffffffffffffffffffffffffffffffffe700000000000000000000000000000000000000000000000000714fffff930000000000000000000000000000000000000000000000000000000000000000000005a000000000000000000000000000f93bad4358ffffffff3fff350a940000000000000102fffffffffffffffffffffffffffffffffffffffffffffffffffffffffff");
    ar.pow(5);
}

However, this doesn't seem reproduce any corruption by itself.

Backtrace:

#0  0x0000556993f51bac in ramp::ll::mul::mul_unbalanced (wp=<optimised out>, xp=<optimised out>, 
    xs=<optimised out>, yp=0x7f277684c880, ys=<optimised out>, scratch=0x7f27768b5610)
    at src/ll/mul.rs:435
#1  0x0000556993f5124d in ramp::ll::mul::mul (wp=0x7f2776844b88, xp=0x7f2776844810, xs=89, yp=0x0, ys=23)
    at src/ll/mul.rs:203
#2  0x0000556993f545a2 in ramp::ll::pow::pow (wp=<optimised out>, ap=<optimised out>, an=<optimised out>, 
    exp=1) at src/ll/pow.rs:75
#3  0x0000556993f576a5 in ramp::int::Int::pow (self=<optimised out>, exp=5) at src/int.rs:422
#4  0x0000556993e4af85 in quickcheck::pow::pow (a=BigIntStr = {...}, b=5) at tests/quickcheck.rs:180
#5  0x0000556993e449a5 in fnfn ()
    at /home/huon/.multirust/toolchains/nightly/cargo/registry/src/github.com-0a35038f75765ae4/quickcheck-0.2.24/src/tester.rs:310
#6  0x0000556993e44528 in fnfn () at ../src/libstd/thread/mod.rs:271
#7  0x0000556993e444ca in quickcheck::sys_common::unwind::try::try_fn<closure> (
    opt_closure=0x7f27ba2fbb10 "") at ../src/libstd/sys/common/unwind/mod.rs:156
#8  0x0000556993fb7869 in __rust_try ()
#9  0x0000556993fb4a1c in sys_common::unwind::try::inner_try::he2b5e2c66f0c0baaMbs ()
#10 0x0000556993e44438 in quickcheck::sys_common::unwind::try<closure> (f=closure = {...})
    at ../src/libstd/sys/common/unwind/mod.rs:126
#11 0x0000556993e44217 in fnfn () at ../src/libstd/thread/mod.rs:271
#12 0x0000556993e44d09 in quickcheck::boxed::F.FnBox<A>::call_box (self=0x7f28010ed430, args=0)
    at ../src/liballoc/boxed.rs:521
#13 0x0000556993fb9424 in sys::thread::_$LT$impl$GT$::new::thread_start::h73f43dc58e7e67e9wEw ()
#14 0x00007f28029489b4 in ?? () from /usr/bin/../lib/librrpreload.so
#15 0x00007f2801bda6aa in start_thread (arg=0x7f27ba2fc700) at pthread_create.c:333
#16 0x00007f2802403eed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

I'm working on this, including trying to get a better capture, via:

while nice rr target/debug/quickcheck-d576313f0448fe02; do :; done

Ok, I think I have some idea about what is causing this. I managed to get an rr trace of a failure due to:

test pow ... thread 'safe' panicked at 'assertion failed: !overlap(wp, xs + ys, yp, ys)', src/ll/mul.rs:191

I think the problem here is the num_pow_limbs calculation is too precise, and doesn't account for a zero limb written as the highest significant limb, e.g. consider 1 * 1 (all literals u64).

the correct answer is 1
num_pow_limbs will deduce that 1 limb is needed and so pointers are allocated with this much space (call the result one p)
the limb multiplication is done as 64-bit * 64-bit -> 128-bit multiplication is done, giving (0, 1)
both of these are written to memory, 1 at p and 0 at p.offset(1) but p only has size 1 so the latter is out of bounds

The corruption happens rarely because it requires just the right set-up:

some power where the last multiplication needed to compute it results in a the upper 64-bits of the 128-bit limb mult being 0 (if it is non-zero then num_pow_limbs will definitely give enough space)
the allocations having just the right sizes, and happening in the right order so that the out-of-bounds 0 lands in some other allocation
that other allocation is used later without being reinitialised

The overlap assertion failure seems to happen even more rarely, because is basically requires xs + ys to be exactly (or +/-1, not sure... in any case, very close) the size of a jemalloc size class, and yp to be allocated immediately after wp. The assertion failure I'm looking at now has xs + ys == 768 (0x300).

I'll investigate this theory in more detail tomorrow.

(As extra evidence in favour of this, all the memory corruption I've seen has been strings being modified to have their first 8 bytes being 0.)

Aatch / ramp

Possible memory corruption in pow(?) #47