Aatch / ramp

RAMP - Rust Arithmetic in Multiple Precision
Apache License 2.0
261 stars 38 forks source link

Possible memory corruption in pow(?) #47

Closed huonw closed 9 years ago

huonw commented 9 years ago

Quickcheck runs will very occasionally fail with an error like

thread 'safe' panicked at 'called `Result::unwrap()` on an `Err` value: ParseIntError { kind: InvalidDigit }', ../src/libcore/result.rs:736

which appears to be caused by the &str being parsed having been corrupted (there's no way the generator will generate an incorrect digit). I managed to catch this with rr with an opt+debuginfo build, which seems to point the finger at https://github.com/Aatch/ramp/blob/8c6676202ab416a636d5f9079e439fca442e6982/src/ll/mul.rs#L420-L422 but the optimisations mean a lot of stuff is optimised out.

The following is the exact pow call in the trace I've found. The corruption appears to occur when exp is 1 in https://github.com/Aatch/ramp/blob/8c6676202ab416a636d5f9079e439fca442e6982/src/ll/pow.rs#L70-L94 .

extern crate ramp;
use ramp::Int;

fn parse(s: &str) -> Int {
    Int::from_str_radix(s, 16).unwrap()
}
fn main() {
    let ar = parse("3283bc9fffffb00000000000000000000000000000000005fffffffffffffffffffffffffffffffffffffffffffe700000000000000000000000000000000000000000000000000714fffff930000000000000000000000000000000000000000000000000000000000000000000005a000000000000000000000000000f93bad4358ffffffff3fff350a940000000000000102fffffffffffffffffffffffffffffffffffffffffffffffffffffffffff");
    ar.pow(5);
}

However, this doesn't seem reproduce any corruption by itself.

Backtrace:

#0  0x0000556993f51bac in ramp::ll::mul::mul_unbalanced (wp=<optimised out>, xp=<optimised out>, 
    xs=<optimised out>, yp=0x7f277684c880, ys=<optimised out>, scratch=0x7f27768b5610)
    at src/ll/mul.rs:435
#1  0x0000556993f5124d in ramp::ll::mul::mul (wp=0x7f2776844b88, xp=0x7f2776844810, xs=89, yp=0x0, ys=23)
    at src/ll/mul.rs:203
#2  0x0000556993f545a2 in ramp::ll::pow::pow (wp=<optimised out>, ap=<optimised out>, an=<optimised out>, 
    exp=1) at src/ll/pow.rs:75
#3  0x0000556993f576a5 in ramp::int::Int::pow (self=<optimised out>, exp=5) at src/int.rs:422
#4  0x0000556993e4af85 in quickcheck::pow::pow (a=BigIntStr = {...}, b=5) at tests/quickcheck.rs:180
#5  0x0000556993e449a5 in fnfn ()
    at /home/huon/.multirust/toolchains/nightly/cargo/registry/src/github.com-0a35038f75765ae4/quickcheck-0.2.24/src/tester.rs:310
#6  0x0000556993e44528 in fnfn () at ../src/libstd/thread/mod.rs:271
#7  0x0000556993e444ca in quickcheck::sys_common::unwind::try::try_fn<closure> (
    opt_closure=0x7f27ba2fbb10 "") at ../src/libstd/sys/common/unwind/mod.rs:156
#8  0x0000556993fb7869 in __rust_try ()
#9  0x0000556993fb4a1c in sys_common::unwind::try::inner_try::he2b5e2c66f0c0baaMbs ()
#10 0x0000556993e44438 in quickcheck::sys_common::unwind::try<closure> (f=closure = {...})
    at ../src/libstd/sys/common/unwind/mod.rs:126
#11 0x0000556993e44217 in fnfn () at ../src/libstd/thread/mod.rs:271
#12 0x0000556993e44d09 in quickcheck::boxed::F.FnBox<A>::call_box (self=0x7f28010ed430, args=0)
    at ../src/liballoc/boxed.rs:521
#13 0x0000556993fb9424 in sys::thread::_$LT$impl$GT$::new::thread_start::h73f43dc58e7e67e9wEw ()
#14 0x00007f28029489b4 in ?? () from /usr/bin/../lib/librrpreload.so
#15 0x00007f2801bda6aa in start_thread (arg=0x7f27ba2fc700) at pthread_create.c:333
#16 0x00007f2802403eed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

I'm working on this, including trying to get a better capture, via:

while nice rr target/debug/quickcheck-d576313f0448fe02; do :; done
huonw commented 9 years ago

Ok, I think I have some idea about what is causing this. I managed to get an rr trace of a failure due to:

test pow ... thread 'safe' panicked at 'assertion failed: !overlap(wp, xs + ys, yp, ys)', src/ll/mul.rs:191

I think the problem here is the num_pow_limbs calculation is too precise, and doesn't account for a zero limb written as the highest significant limb, e.g. consider 1 * 1 (all literals u64).

The corruption happens rarely because it requires just the right set-up:

The overlap assertion failure seems to happen even more rarely, because is basically requires xs + ys to be exactly (or +/-1, not sure... in any case, very close) the size of a jemalloc size class, and yp to be allocated immediately after wp. The assertion failure I'm looking at now has xs + ys == 768 (0x300).

I'll investigate this theory in more detail tomorrow.

(As extra evidence in favour of this, all the memory corruption I've seen has been strings being modified to have their first 8 bytes being 0.)