Using `awint` on AVR - Githubissues

xphoniex commented 1 year ago

here's what I have now:

use awint::{cc, inlawi, inlawi_ty, Bits, InlAwi};

let mut Gx: inlawi_ty!(512) = inlawi!(0i128, 0i128, 0111100110111110011001100111111011111001110111001011101110101100010101011010000001100010100101011100111010000111000010110000011100000010100110111111110011011011001011011100111000101000110110010101100111110010100000010101101100010110111110000001011110011000);
let mut pad: inlawi_ty!(512) = InlAwi::zero();
println!("{:x}", Gx);

let minus_1: inlawi_ty!(512) = inlawi!(0i128, 0i128, 0i128, -1i128);
println!("{:x}", minus_1);

Gx.mul_assign(&minus_1, &mut pad);
println!("{:x}", Gx);

output:

0x79be667e_f9dcbbac_55a06295_ce870b07_029bfcdb_2dce28d9_59f2815b_16f81798_u512
0xffffffff_ffffffff_ffffffff_ffffffff_u512
0x79be667e_f9dcbbac_55a06295_ce870b06_88dd965c_33f16d2d_04521ec5_48710c90_fd640324_d231d726_a60d7ea4_e907e868_u512

but here's what I'm getting from python:

hex((0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798 * -1) & (2**512-1))
'0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff8641998106234453aa5f9d6a3178f4f8fd640324d231d726a60d7ea4e907e868'

AaronKutch commented 1 year ago

It looks like you are using an older version of awint, I used a trailing _ shorthand for assignment operations in a newer version. Like the documentation on Bits says, it does not know signedness. Instead, "If a method’s documentation does not mention signedness, it either works for both kinds or views the bits as a plain bit string with no integral properties". If you just want to negate a number, you can use awi.neg_(true). If you want to use the mul_ function like the python example, the minus_1 variable needs to be inlawi!(-1_i512), or in other words the 512 bit signed representation of -1. Instead, you are using concatenation to attach 3*128 leading zero bits to -1i128, which is not the same as -1i512. Why are you doing this? I'm realizing that maybe I should add a tutorial on two's complement somewhere. -1 in two's complement is always all ones.

    use awint::awi::*;

    let mut gx: inlawi_ty!(512) = inlawi!(0i128, 0i128, 0111100110111110011001100111111011111001110111001011101110101100010101011010000001100010100101011100111010000111000010110000011100000010100110111111110011011011001011011100111000101000110110010101100111110010100000010101101100010110111110000001011110011000);
    let mut pad: inlawi_ty!(512) = InlAwi::zero();
    println!("{:x}", gx);

    //let minus_1: inlawi_ty!(512) = inlawi!(-1i128, -1i128, -1i128, -1i128);
    // or
    let minus_1: inlawi_ty!(512) = inlawi!(-1i512);
    // or
    //let minus_1: inlawi_ty!(512) = inlawi!(umax: ..512);
    println!("{:x}", minus_1);

    gx.mul_(&minus_1, &mut pad);
    println!("{:x}", gx);

    // or you could have called
    //gx.neg_(true);

0xffffffff_ffffffff_ffffffff_ffffffff_ffffffff_ffffffff_ffffffff_ffffffff_86419981_06234453_aa5f9d6a_3178f4f8_fd640324_d231d726_a60d7ea4_e907e868_u512

AaronKutch commented 1 year ago

Also, the default debug impls use unsigned hexadecimal interpretation, you can view the signed value by println!("{}0x{}", if gx.msb() {"-"} else {""}, ExtAwi::bits_to_string_radix(&gx, true, 16, false, 1).unwrap());

AaronKutch commented 1 year ago

I'm curious what your application of my crate is, do you want it for fixed width 512 bit arithmetic? Tell me if you have any usability problems.

AaronKutch commented 1 year ago

Also, I just realized the documentation on arb_imul_add_ has dummy information from a copy+paste error I made months ago, I don't know how I missed that until now, I will fix that in a future update. If you want signed multiplication on differently sized integers you can use that.

AaronKutch commented 1 year ago

I have published a new version of awint, which includes recent fixes and a lot about two's complement and overflow in the crate level documentation of awint_core. It is kind of rushed, I need to refine it. Tell me if you have any confusion or suggestions.

xphoniex commented 1 year ago

It looks like you are using an older version of awint, I used a trailing _ shorthand for assignment operations in a newer version.

Yes, I'm aware, 0.8 needs a higher version of rust than what I'm using.

Instead, you are using concatenation to attach 3*128 leading zero bits to -1i128, which is not the same as -1i512. Why are you doing this?

I tried -1i512 but wasn't able to get it to compile. I can't re-produce it now, so it probably was unrelated and the compiler message was not clear enough. The examples are not showing any big number either, so I thought options are either concatenation of smaller ones or binary representation.

I know how two's complement works, and I was a bit confused as I didn't know how to properly concatenate a negative number.

Also, the default debug impls use unsigned hexadecimal interpretation...

This is cool, but I'm using this in no_std env, so can't even get unsigned hex. Also there's no easy way to iterate bytes. There is const_as_ref however it seems I have to read the buf out of the struct using to_u8_slice which is not ideal, isn't it possible to iterate the bytes without creating a new buf? I haven't looked under the hood to see how bits are represented yet.

I'm curious what your application of my crate is, do you want it for fixed width 512 bit arithmetic? Tell me if you have any usability problems.

I'm using it for elliptic curve cryptography on embedded targets, which means it has to be both no_std and no alloc. Right now, only two crates meet the criteria, one is fixed-bigint-rs which only supports unsigned and there's your awint.

In terms of usability, I think if you implement the operators so that I can use a + b instead of looking into the underlying functions, that'd be pretty nice. Needed ops are +, -, %, and pow, plus obvious bit operations.

A more interesting set of utility would be to have a mod ring, e.g. I want to multiply a * b but result should be mod P, so right now I use (a % P) * (b % P) % P, which is harder to read and probably not as performant.

I have published a new version of awint, which includes recent fixes and a lot about two's complement and overflow in the crate level documentation of awint_core. It is kind of rushed, I need to refine it. Tell me if you have any confusion or suggestions.

Looks good.

I think what is lacking right now in docs (and src) are clear examples. Most examples are doing the same thing, e.g. concatenating a number. However there are not enough examples for multiplication, modulo, subtraction, etc.

AaronKutch commented 1 year ago

My crate can't implement the std ops because it would involve allocations and fallible operations under the hood. Actually, maybe I could implement them for the InlAwi types because I could enforce equal widths then, and there is stack storage for them. However, the problem still is that it can encourage losing the perf gains I intended with inplace assignment. Also I just remembered I couldn't do / or % because it would be ambiguous whether unsigned or signed division was meant. const_as_ref and const_as_mut are functions that are there from before when const traits were a thing in nightly. Actually using const in my crate still requires full up nightly (because of the &mut Bits DST), so maybe I should remove them. The inplace assignment operations are probably the most unconventional thing for newcomers to use, I should have more extensive examples. I think an elliptic curve routine would be a nice example, show me if you can. For modulo over primes, you just need a dummy buffer for the unused quotient output of Bits::udivide. Bits::udivide needs a minimum of 2 mutable buffers anyway to do division without allocation.

The way you use to_u8_slice is simply create a stack allocated [u8; N] if you know the sizes ahead of time (or else use Vec<u8>), and then pass &mut buffer to it. For 512 bits you need 512 / 8 = 64 bytes. It and u8_slice_ is intended to get plain byte arrays in and out of Bits as cheaply as possible. There isn't an AsBytes, for example, because of difficulties with big-endian systems. However, now that I think about it, it should be possible to make an iterator over individual bytes, the iterator struct just has to be configured to jump around properly on big-endian.

xphoniex commented 1 year ago

I completed my lib using fixed-bigint-rs but as I was suspecting, I ran into issue with RAM size, as my target chip - Arduino Uno - only has 2 KB of RAM. Been trying hard to optimize the code, but there's not much more I can do.

Tried getting the stack size from both rustc and avr-gcc but they are giving me rubbish, so instead I turned to an ARM target, and here are some stack sizes:

Screenshot from 2023-03-04 17-33-27

Seems like my double_assing_mod fn needs 1424 bytes to run, here's why:

to multiply two nums, they have to be of the same size. keys are pretty big so, 256-bit and even if you want to multiply them by 2 which fits into a u8, you'd still have to allocate another full sized num for it.
since there's no mul_mod fn, in order to get the product of two 256-bit nums, you'd have to allocate a 512-bit num, and since there are other addition/multiplication/etc too, then everything has to be 512-bit or 64 bytes.

here's the code:

    #[inline]
    fn double_assign_mod(&mut self, P: &BigNum, one: &BigNum) {
        //let X1 = self.x.clone();
        let X1 = &self.x; // does this cost any memory or compiler is smart enough?
        //let Y1 = self.y.clone();
        let Y1 = &self.y;

        let two = BigNum::from_u8(2).unwrap(); // 64-bytes, can be reduced to 1-byte
        let three = BigNum::from_u8(3).unwrap(); // 64-bytes, can be reduced to 1-byte

        // every multiplication creates another 64-bytes num, as they are not in-place
        let lam = ((three * ((*X1 * *X1) % P)) % P) * Self::invert(&(two * Y1), P) % P; 

        let X3 = (lam * lam - two * X1) % P;
        let Y3 = if *X1 < X3 {
            (lam * (*P + X1 - X3) - Y1) % P
        } else {
            (lam * (*X1 - X3) - Y1) % P
        };

        self.x = X3;
        self.y = Y3;
    }

I found the same issue in awint, e.g. I can't multiply a big number 0xaf321..... by two if they are different sizes. If it's in-place, I don't see why it shouldn't be possible? You can multiply a 200u256 * 2u8, the same way you'd do other muls, and keep the result in-place. As you can see in the example this would save at least 112 bytes (64 + 64 - 8 - 8).

If there's also a way to do fixed 32-byte mul_mod then it require way less memory, whole thing can be reduced from 1424 bytes to 2 or 3 allocations so 96 at most. also notice % ops, which are delegated to FixedUint<T, _>::div_rem are pretty expensive too. (Montgomery modular multiplication)

Thoughts?

AaronKutch commented 1 year ago

2 KB of RAM, wow that's tight. I want to point out one performance footgun in such a case, be careful when moving around InlAwis. InlAwi implements Copy, so if you assign the same straight InlAwi storage to two bindings, then one of them may turn into a copy into a second storage. One way to preclude potential problem is to use &Bits and &mut Bits references of the backing storage, so your let X1 = &self.x; line should be ok. I made some examples with awint 0.7. Here is my first pass on your problem, note that Self::invert is unimplemented because I don't know what it does:


// assumes `x1`, `y1`, and `p` are all 512 bits
fn double_assign_mod(x1: &Bits, y1: &Bits, p: &Bits) {

    // in this first part, we are calculating
    // let lam = ((3 * ((x1 * x1) % p)) % p) * Self::invert(&(two * Y1), P) % P;

    // int0 = (x1 * x1) % p
    let mut x1_copy = inlawi!(x1; ..512).unwrap();
    let mut pad = inlawi!(0u512);
    x1_copy.mul_assign(&x1, &mut pad).unwrap();
    // just renaming, this is zero cost if we don't use x1_copy later
    let x1_squared = x1_copy;
    let mut int0 = inlawi!(0u512);
    Bits::udivide(&mut pad, &mut int0, &x1_squared, &p).unwrap();

    // x1_squared, pad, int0 are live

    // int1 = (3 * into) % p
    int0.short_cin_mul(0, 3);
    let mut int1 = x1_squared;
    Bits::udivide(&mut pad, &mut int1, &int0, &p).unwrap();

    // int1, pad, int0 are live

    let mut y1_copy = inlawi!(y1; ..512).unwrap();
    // equivalent to multiplying by 2
    y1_copy.shl_assign(1).unwrap();
    let y1_mul2 = y1_copy;
    // I don't know what the `Self::invert` function is, suppose we calculated it from `y1_mul2`
    // and `p` and stored it in `invert_result`, here let me just use `y1_mul2`.
    let mut invert_result = y1_mul2;

    // int1, pad, int0, invert_result are live

    // int2 = (int1 * invert_result) % p;
    invert_result.mul_assign(&int1, &mut pad);
    let mut int2 = int0;
    Bits::udivide(&mut pad, &mut int2, &invert_result, &p).unwrap();
    let mut lam = int2;

    // int1, pad, lam, invert_result are live

    // in this next part we are calculating
    /*let X3 = (lam * lam - two * X1) % P;
    let Y3 = if *X1 < X3 {
        (lam * (*P + X1 - X3) - Y1) % P
    } else {
        (lam * (*X1 - X3) - Y1) % P
    };

    self.x = X3;
    self.y = Y3;*/

    // (lam * lam - two * x1) % p
    let mut lam_copy = invert_result;
    cc!(lam; lam_copy).unwrap();
    lam_copy.mul_assign(&lam, &mut pad).unwrap();
    let mut lam_squared = lam_copy;
    lam_squared.sub_assign(&x1).unwrap();
    // we would have to copy anyway, just subtract twice
    lam_squared.sub_assign(&x1).unwrap();
    let mut x3 = int1;
    Bits::udivide(&mut pad, &mut x3, &lam_squared, &p).unwrap();

    // x3, pad, lam, lam_squared are live

    // (
    //   ((x1 - x3 + (`p` if x1 < x3)) * lam)
    //   - y1
    // ) % p
    let mut tmp = lam_squared;
    cc!(x1; tmp).unwrap();
    tmp.sub_assign(&x3).unwrap();
    if x1.ult(&x3).unwrap() {
        tmp.add_assign(&p).unwrap();
    }
    lam.mul_assign(&tmp, &mut pad).unwrap();
    lam.sub_assign(&y1).unwrap();
    let mut y3 = tmp;
    Bits::udivide(&mut pad, &mut y3, &lam, &p).unwrap();

    // x3 and y3 are our results
    dbg!(x3, y3);
}

Depending on how deep the stack is and how many function calls you have, what you could do is have some Organizational struct that every function is impled for to take &mut self of, and you can have all the storages preallocated there. Then you can pass stuff into functions like &self.pad, this means you don't have redundant temporaries across the call stack if you only use stuff from the organizational struct, just remember to zero or copy assign.

The main limiting factor is that my crate's udivide implicitly needs 4 storages of the same size, and if the intermediate results we want to divide are 512 bits that means a minimum of 256 bytes. If you want, I could design an arbitrary width division function similar to how I have arb_umul..., but at the moment I don't have the free time unfortunately. Montgomery modular multiplication would be more optimized for this specific task, that will take even more time. However, I have a workaround in terms of multiplicative inverses. What you can do is calculate the reciprocal ahead of time in a function that has the many storages thing (do not put these in your organizational struct), but make sure this function is called as early as possible so the call stack is shallow, and make sure that just the reciprocal, prime, and a shift value are returned so that all the intermediates are deallocated. a mod p = a - p*floor(a/p) in real numbers, we can use fixed point tricks to transform this into a mod p =a - p * floor((a * reciprocal) / (2^recip_pow)) where reciprocal = floor((2^recip_pow) / p) and is calculated ahead of time for multiple uses, and / (2^recip_pow) can be done by a shift, this is the theoretically fastest way in fact of doing these calculations if you need a lot of modulos by the same value in a row. I presume that p is a large prime, note that my crate has inplace multiplication and division functions for small constants under 2^16.

fn main() {
    let prime = inlawi!(168361517_u256);
    let (reciprocal, reciprocal_pow) = calc_reciprocal(&prime);
    // so this method will need about 132 bytes after `calc_reciprocal` is done once,
    // but `tmp256` can be doubly used inbetween reciprocal calculations
    // and `tmp528` is useful for other expanding multiplications
    let mut tmp256 = inlawi!(0u256);
    let mut tmp528 = inlawi!(0u528);

    // the value we want to modulo over
    let input = inlawi!(23175132985373896607622_u256);

    let mut result = inlawi!(0u256);
    mod_with_reciprocal(&mut result, &input, &prime, &reciprocal, reciprocal_pow, &mut tmp256, &mut tmp528);
    dbg!(result);
    println!("result in decimal: {}", ExtAwi::bits_to_string_radix(&result, false, 10, false, 1).unwrap());

    // use mod_with_reciprocal many times for only one calc_reciprocal call
}

fn calc_reciprocal(prime: &Bits) -> (inlawi_ty!(264), usize) {
    // It's been a while since I've analyzed stuff like this properly,
    // I _think_ what you want is to make a "giant one" with value
    // 2^(floor(log_2(p)) + 1 + number_of_bits_of_largest_a). If `a`
    // can take up the full 256 bits then it unfortunately means you
    // will have to increase a little beyond 256 so the reciprocal
    // will be larger than 256 bits
    assert_eq!(prime.bw(), 256);
    let reciprocal_pow =prime.sig() + 1 + 256;
    // need a large enough temporary
    let mut giant_one = inlawi!(0u528);
    giant_one.set(reciprocal_pow, true).unwrap();
    let mut tmp_prime = inlawi!(0u528);
    let o = tmp_prime.zero_resize_assign(&prime);
    assert!(!o);
    let mut quo = inlawi!(0u528);
    let mut rem = inlawi!(0u528);
    Bits::udivide(&mut quo, &mut rem, &giant_one, &tmp_prime).unwrap();
    let mut reciprocal = inlawi!(0u264);
    let o = reciprocal.zero_resize_assign(&quo);
    // make sure the reciprocal can fit in 256 bits
    assert!(!o);
    (reciprocal, reciprocal_pow)
}

fn mod_with_reciprocal(result: &mut Bits, input: &Bits, prime: &Bits, reciprocal: &Bits, reciprocal_pow: usize, tmp256: &mut Bits, tmp528: &mut Bits) {
    assert_eq!(result.bw(), 256);
    assert_eq!(input.bw(), 256);
    assert_eq!(tmp256.bw(), 256);
    assert_eq!(tmp528.bw(), 528);

    // note: remember that the multiplication function has `_add` in it, it adds to the
    // already existing value of `double_tmp` so you may need to zero it out before calls,
    // I am doing it here to demonstrate
    tmp528.zero_assign();
    tmp528.arb_umul_add_assign(&input, &reciprocal);
    tmp528.lshr_assign(reciprocal_pow).unwrap();
    tmp256.zero_assign();
    tmp256.arb_umul_add_assign(&tmp528, &prime);
    tmp256.rsb_assign(&input).unwrap();
    // because of flooring errors, it can be exactly off in the mod(a, p) == 0 case,
    // TODO need numerical analysis to verify this algorithm is correct
    if tmp256.const_eq(&prime).unwrap() {
        // double check analytically, either this is perfect and we can simply zero it,
        // we need a subtract, or I think we might be in a
        // Knuth-like situation where we need 2 subtracts
        tmp256.zero_assign();
    }
    assert!(tmp256.ult(&prime).unwrap());
    result.copy_assign(&tmp256).unwrap();
}

I'm running out of time, I haven't bug tested double_assign_mod but I hope you can infer enough to fix the problems and use mod_with_reciprocal in place of the udivides

xphoniex commented 1 year ago

Thanks a lot for the code sample, I wasn't expecting that!

xphoniex commented 1 year ago

Ported everything and this is orders of magnitude faster than fixed-bigint-rs. For multiply_DA, fixed-bigint-rs takes 8.59s (in release mode) vs. 20.23ms of awint... 424x faster!

Still I can't get memory usage down, here's cargo +nightly call-stack for ARM target (call-stack doesn't work with AVR):

Screenshot from 2023-03-05 22-05-22

I pushed the library here, I don't think there's much more optimization that can be done this.

If you find time please have a look at let me know what you think. I'll try the reciprocal idea later. if I can reduce BigNum from 512-bit down to 256-bit, then everything would work fine, I think.

xphoniex commented 1 year ago

also, any chance of supporting const without needing alloc? Not sure, but I think there's a way to move static memory out of RAM, so it'd save 64 * 3 = 192 bytes if I can make SECP256K1 const.

AaronKutch commented 1 year ago

My crate has a const_support flag, which allows const bigint operations if the backing storage is InlAwi. I don't remember it implicitly requiring alloc anywhere, is there some error if you try it? The thing is that it requires nightly, also look at the README because it includes some information on #![feature... that you have to add in order for the macros to work with const. There is also a way to make a &'static Bits for the constants, a short example should be in the docs for InlAwi. Also, awint 0.7 is not the latest, awint 0.9 is the latest and has a renaming to truncate most _assigns, sorry if this causes trouble.

AaronKutch commented 1 year ago

Also, is double_assign_mod using 1256 bytes? It shouldn't be using nearly that much, it should only be having like 200 bytes or so over the udivide which should be the bottleneck, maybe my hypothesis about let assigning was wrong or optimizations aren't applied or something. You are compiling with -Os (I think thats the small optimization flag)? Great to hear that the runtime seems to be fast though.

AaronKutch commented 1 year ago

One more thing, the macros generate their constants as a byte array entered into a const function, I don't know if the compiler puts that into static memory automatically if const_support is enabled or if I need to configure some extra stuff.

xphoniex commented 1 year ago

My crate has a const_support flag, which allows const bigint operations if the backing storage is InlAwi. I don't remember it implicitly requiring alloc anywhere, is there some error if you try it? The thing is that it requires nightly, also look at the README because it includes some information on #![feature... that you have to add in order for the macros to work with const. There is also a way to make a &'static Bits for the constants, a short example should be in the docs for InlAwi. Also, awint 0.7 is not the latest, awint 0.9 is the latest and has a renaming to truncate most _assigns, sorry if this causes trouble.

I tried const_support and it wouldn't compile because it'd pull in awint_ext which reuires alloc.

I tried adding the extra #![feature here but didn't change anything.

AaronKutch commented 1 year ago

I think I forgot to note a fix in a changelog, I needed the special ? syntax in awint_ext?/const_support in a Cargo.toml, this was fixed after 0.8 I think, are you using the latest? I could backport a fix if you need it

xphoniex commented 1 year ago

no I'll update to the latest soon

You are compiling with -Os (I think thats the small optimization flag)?

I compile with either opt-level = "s" or opt-level = "z" depending on which one produces a smaller elf. I tried with 3 and failed:

          Memory region         Used Size  Region Size  %age Used
                      text:       41238 B        32 KB    125.85%
                      data:         165 B         2 KB      8.06%
                    eeprom:          0 GB         1 KB      0.00%
                      fuse:          0 GB          3 B      0.00%
                      lock:          0 GB         1 KB      0.00%
                 signature:          0 GB         1 KB      0.00%
           user_signatures:          0 GB         1 KB      0.00%

as flash size is 32 KB on my chip. I can compile with ~~-O2~~ O1 and it doesn't overflow the stack (unlike s and z), but keeps hanging forever and I don't know how to debug it.

AaronKutch commented 1 year ago

-O2 hangs? that might be a bug. The "z" opt level is what you want for if things are really tight

xphoniex commented 1 year ago

made a mistake -O1 hangs, -O2 overflows (same for z and s), -O3 doesn't compile as it doesn't fit.

AaronKutch commented 1 year ago

That would be expected, those two are lesser than -O3, you might try https://github.com/johnthagen/min-sized-rust although it is intended for stuff with std cruft

xphoniex commented 1 year ago

already applied everything there

AaronKutch commented 1 year ago

32K is crazy small, I unfortunately don't know what to do for that, could you see what https://github.com/RazrFalcon/cargo-bloat says? edit: wait that doesn't support embedded

xphoniex commented 1 year ago

there's a C library called micro-ecc which can do this just fine on C, and I got it to work but I was hoping to do it in pure rust.

I think he only uses double space for multiplications, that'd solve it for us as well if we can somehow limit everything to 256-bit except for multiplications which would be 512-bit.

AaronKutch commented 1 year ago

The reciprocal algorithm I put above uses multiplications to 528 bits, but the initial calculation requires udivide which definitely takes a lot of assembly instructions. You could adapt a Bits scale binary division algorithm (see this algorithm, use Bits::set(quo, shl, true) for quo += 1 << shl). I presume you can't calculate the primes ahead of time.

AaronKutch commented 1 year ago

The other likely culprit is that 8-bit microcontrollers have 8-bit registers but 16-bit usizes, and my bigints are based on usize arrays, indexing and bitwidths with usize, and other things. I can't just make a MachineInteger that is 8-bits because then we would be limited to 255 bit bigints, so it would have to be a very extensive bifurcation change in the awint internals unfortunately.

AaronKutch commented 1 year ago

What would be more appropriate is if I made a second awint with different invariants that was based on the byte level rather than the bit level, but that will take some serious development time

xphoniex commented 1 year ago

The other likely culprit is that 8-bit microcontrollers have 8-bit registers but 16-bit usizes, and my bigints are based on usize arrays, indexing and bitwidths with usize, and other things. I can't just make a MachineInteger that is 8-bits because then we would be limited to 255 bit bigints, so it would have to be a very extensive bifurcation change in the awint internals unfortunately.

Hm.. can you come up with a short example to demonstrate that?

Because before starting with awint, I experimented with multiplying (somewhat) big numbers and the results were correct, I think. I have removed/added a lot of code, so it'd be nice if you could produce some short examples which we could tests if awint operations are valid on 8-bit AVR.

AaronKutch commented 1 year ago

I'm not talking about correctness, what I'm talking about here is that if the ALU handles data in units of u8s and the digit-by-digit code is working on u16, every subroutine called in a loop over an array of u16s will need twice the number of operations (or 4x more for operations like multiplication) as an equivalently correct loop over u8s.

AaronKutch commented 1 year ago

I just remembered something, I once made https://github.com/AaronKutch/u64_array_bigints which is a minimalistic biginteger library with just unsigned representations and stack based stuff. If you want total control, you could clone it, rename it to u8_array_bigints, remove all the U256 stuff and just keep Uint, replace most u64s with u8s, if there are u128 widens replace them with u16, fix the tests (the tests test against awint so you don't have to recreate awint's huge testing suite), and then all that remains is implementing something analogous to arb_umul_add_assign and doing something about the hex macros.

xphoniex commented 1 year ago

Would I gain any performance over awint if I use u64_array_bigints?

Good news, I got awint to work with AVR yesterday. It was a huge pain to debug it, as there's no global println and panic macro is both inaccurate & takes > 800 bytes of RAM itself. Regardless this is the culprit and why it was hanging:

    let mut private_key = inlawi!(0x02_u512);
    let curve = Curve::secp256k1();
    let public_key = curve.multiply_simple(&mut private_key);

    let mut buf = [0u8; 64]; <------
    public_key.x.to_u8_slice(&mut buf);
    print_hex_arr("x", &mut serial, &buf);
    public_key.y.to_u8_slice(&mut buf);
    print_hex_arr("y", &mut serial, &buf);

previously I was initiating a 64-byte wide buf, and then move x & y into it (each 32-byte), this works on x64, and even on embedded ARM, but somehow blocks on AVR. I guess I can get away with it due to most surrounding bits being 0 on bigger architecture, as I don't expect AVR processor to have such security feature in place, to halt out of bound access.

here's some stats:

$ avr-size target/avr-atmega328p/release/app.elf
   text    data     bss     dec     hex filename
  12228     196       1   12425    3089 target/avr-atmega328p/release/app.elf

program compiles down to 12KB (37.5% of 32KB) and uses 196B to start, plus analysis for ARM shows ~900 bytes usage for multiplication, almost 53% of total RAM.

Next, I need to implement sha256, which takes at least 512 bytes, a bit more for buf, so almost 80% of RAM.

If I can keep everything under 2KB of RAM usage, this would be the only pure Rust lib that can do ecc signing, for any architecture. Other crates work on >32-bit arch or need alloc, etc.

But if not then I'd have to revisit reciprocal or other ideas you've suggested, and in the meanwhile, if you're feeling generous, you're welcome to add 256-bit mod_mul here, if it lowers stack usage.

AaronKutch commented 1 year ago

So you jumped straight from 41KB to 12KB, what made the difference? Also, are you saying that my to_u8_slice function is hanging on AVR?

You definitely don't want to use u64_array_bigints itself, I was just suggesting that if you wanted to go the extra mile, you can clone that repo and change it to be u8 based rather than u64 based, and that should fix the digit unrolling problem and reduce memory usage further. Although, I'm assuming too much and would need to look at the assembly and other actual details to see what is happening, and if we need the separated storage and reference system or not. This discussion made me realize that in the future, I will need to do some bifurcation of the bitwidth/indexing integer size and digit size in awint, because it's not just 8 bit microcontrollers but also special architectures like CHERI and probably some future 128 bit arch.

But before you go to that length, I would suggest the reciprocal method as that makes it so only one large 528 bit buffer is needed at a time rather than 4 x 512.

xphoniex commented 1 year ago

So you jumped straight from 41KB to 12KB, what made the difference?

Correct. I use opt-level = "s | z", which make it smaller. If I use opt-level = 3, I'll still get 41KB size and fail to compile. Also, if I don't enable lto = true, code compiles & runs, but output is gibberish.

For example, here's the same program with opt-level = 1:

$ telnet localhost 5678
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
00000000050eac6d7dc02c6da1595563abd6bd57fec00e23e14e0806d8af1ea300000000016a309aa25b0403f1eef75702e84bb7597aabde5c51aa3ec721cb2bC

whereas with opt-level = 2 | "s" | "z" , I get the correct output:

$ telnet localhost 5678
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
x = c6047f9441ed7d6d3045406e95c07cd85c778e4b8cef3ca7abac09b95c709ee5
y = b7c52588d95c3b9aa25b0403f1eef75702e84bb7597aabe663b82f6f04ef2777
Connection closed by foreign host.

Also, are you saying that my to_u8_slice function is hanging on AVR?

It hangs only when I use a bigger size buf. When I use a 32-byte buf both numbers work. But depending on where in the program it is, I can either use 62-byte or up-to 36-byte. I don't know how it works, and as you can see working with avr compiler is random.

I should have gone with ARM, but that would have been boring, as there are plenty of libs supporting that.

But before you go to that length, I would suggest the reciprocal method as that makes it so only one large 528 bit buffer is needed at a time rather than 4 x 512.

I'll have to re-visit this in the future.

AaronKutch commented 1 year ago

One more thing, you can use as_bytes_full_width_nonportable to get a transmuted &[u8] view into &Bits, and there is also a as_mut_bytes_full_width_nonportable (this is almost zero-cost vs. u8_slice_ and to_u8_slice), but you have to be careful, as groups of bytes can be reversed if the architecture is big-endian, and if the bitwidth is not a multiple of usize::BITS you have to deal with the unused bits. See the documentation for more.

xphoniex commented 1 year ago

Ok, I'll try them out.

as groups of bytes can be reversed if the architecture is big-endian

it already is on AVR.

also, a lot of embedded projects use an older rustc version (for stability, etc), and since your lib adds arithmetic which is something libs could depend on, it's probably best not to always use the latest rust.

did a quick estimation re memory usage:

count memory usage: 

main        : n-or-k (1x64)                     = 64 (not counted)

Curve       : G (2x64) + P (64) // n is optimized out by compiler   = 192
multiply_DA : p(2x64) + d(2x64)                     = 256
add_ass_mod : 2 alloc (2x64) + invert (4 x 64)              = 384
dbl_ass_mod : 3 alloc (3x64) + invert (4 x 64)              = 448
                                                               --------------
                                                                        = 1280

~~pretty interesting compiler can get it down to 900.~~

xphoniex commented 1 year ago

Screenshot from 2023-03-07 14-02-08

actually no. caro call-stack is showing usage of dbl_ass_mod @ max 856, local 656 and this is the best optimization. any idea why this is?

it's almost twice what we estimated! shouldn't local one at least be close to what we estimated?

for example, can you confirm memory usage of ashr_assign @ 72/40 bytes, or mul_assign @ 72/56 bytes? rest seems to be ARM-specific memmove, which are unexpected, unless they are from to_u8_slice.

AaronKutch commented 1 year ago

Uhh you shouldn't be using ashr_assign unless you intend the integers to be signed, you probably want lshr_assign. The associated bytes with all Bits functions are only the passing of Bits references and saving of registers on the stack. I don't have the setup to look at AVR assembly, but for ashr_assign it pushes 8 registers and does stuff like memcpy. For AVR, every &Bits reference has 2 bytes for the pointer and 2 bytes for DST length, meaning things add up pretty quickly. inc_assign makes sense since it takes 4 bytes for &self and needs to save 4 more bytes, but ashr_assign takes more than I would would expect.

I'm experimenting on a small example with riscv32i and getting some really weird results with how much stack is being taken if large storages are made in a function. ashr_assign and others take no stack space however, presumably because there are enough scratch registers for temporaries.

AaronKutch commented 1 year ago

Ok, next chunk of freetime I get (may be next week, may be a few months), I will do the bifurcation thing so that you don't need to do the u8_array_bigints thing. I think I can improve the way the macros generate constants, so that it is guaranteed they are in .text . I will also revive the bits macro just for generating &'static Bits without going through intermediate steps.

On your side, I'm thinking the best strategy is to only use Bits references in most your functions, and use the organizational struct strategy I mentioned earlier. The struct is passed around as &mut self so that only one 2 byte pointer is being passed around instead of more. The organizational struct will have a constructor that has all of the inlawi! calls in it, so that the weird extra allocations should be deallocated when the struct is returned and ready. In your functions, instead of using self.tmp256_0 etc everywhere, use let bindings to temporarily rename for sanity, this will be guaranteed to not accidentally incur non zero cost stuff unlike the previous strategy.

AaronKutch commented 1 year ago

I found something, it turns out that the function that inlawi! uses for initialization with constants, unstable_from_u8_slice, is being inlined into wherever it is used. This is good if the compiler is optimizing correctly, what the compiler should be doing for this example

pub fn internal(x: &mut Bits, neg: bool) {
    let y = inlawi!(-3858683298722959599393.23239873423987_i1024f512);
    x.neg_add_(neg, &y).unwrap();
}

Is subtracting 128 from the stack pointer (plus 32 or so depending on the architecture for other things), doing one memcpy, and then calling neg_add_. However, what it was doing before is reserving an extra 128 bytes, doing a memset, and adding a second memcpy doing a seemingly useless movement between two 128 byte regions (the first memcpys is in u8_array_). The left over 128 bytes is probably what is inflating memory usage by twice as expected.

        addi sp, sp, -288
        .cfi_def_cfa_offset 288
        sw ra, 284(sp)
        sw s0, 280(sp)
        sw s1, 276(sp)
        sw s2, 272(sp)
        .cfi_offset ra, -4
        .cfi_offset s0, -8
        .cfi_offset s1, -12
        .cfi_offset s2, -16
        mv s0, a2

        mv s1, a1

        mv s2, a0

        addi a0, sp, 136
        li a2, 128
        li a1, 0
        call memset@plt
        li a0, 1024
        sw a0, 264(sp)

        lui a0, %hi(.L__unnamed_1)
        addi a2, a0, %lo(.L__unnamed_1)

        addi a0, sp, 136
        li a1, 33
        li a3, 128
        call awint_core::data::bits::Bits::u8_slice_

        mv a0, sp
        addi a1, sp, 136
        li a2, 132
        call memcpy@plt

        mv a3, sp
        li a4, 33
        mv a0, s2

        mv a1, s1

        mv a2, s0

        call awint_core::logic::sum::<impl awint_core::data::bits::Bits>::neg_add_

        beqz a0, .LBB0_2

        lw ra, 284(sp)
        lw s0, 280(sp)
        lw s1, 276(sp)
        lw s2, 272(sp)
        addi sp, sp, 288
        ret

By inlining stuff manually I got the compiler to do what it should do

        addi sp, sp, -160
        .cfi_def_cfa_offset 160
        sw ra, 156(sp)
        sw s0, 152(sp)
        sw s1, 148(sp)
        sw s2, 144(sp)
        .cfi_offset ra, -4
        .cfi_offset s0, -8
        .cfi_offset s1, -12
        .cfi_offset s2, -16
        mv s0, a2

        mv s1, a1

        mv s2, a0

        lui a0, %hi(.L__unnamed_1)
        addi a1, a0, %lo(.L__unnamed_1)
        addi a0, sp, 8
        li a2, 128
        call memcpy@plt
        li a0, 1024
        sw a0, 136(sp)

        addi a3, sp, 8
        li a4, 33
        mv a0, s2

        mv a1, s1

        mv a2, s0

        call awint_core::logic::sum::<impl awint_core::data::bits::Bits>::neg_add_

        beqz a0, .LBB0_2

        lw ra, 156(sp)
        lw s0, 152(sp)
        lw s1, 148(sp)
        lw s2, 144(sp)
        addi sp, sp, 160
        ret

See how much cleaner it is. If I had more time I would reproduce with primitive arrays and file a bug report.

AaronKutch commented 1 year ago

Ok, I'm most of the way to fixing things. I just need a way to 'cfg' for AVR (and any other arches you can think of), they aren't distributed by rustup so I would have a hard time testing it. Is there a way to activate feature flags based on arch?

xphoniex commented 1 year ago

I'm thinking the best strategy is to only use Bits references in most your functions, and use the organizational struct strategy I mentioned earlier. The struct is passed around as &mut self so that only one 2 byte pointer is being passed around instead of more.

I'm inlining every function call so I think that should not matter? Also, I can't seem to inline double_assign_mod as it shows garbage output. Either it's ruining some memory or compiler is outputting wrong code, either way I think I'm gonna file a bug report for gcc.

Re the assembly, is that x64, arm, or avr? You're initializing a f512 value in inlawi!. Can you comment out the allocations in assembly code?

Btw, you're saying this is compiler's fault, yes?

Ok, I'm most of the way to fixing things. I just need a way to 'cfg' for AVR (and any other arches you can think of), they aren't distributed by rustup so I would have a hard time testing it. Is there a way to activate feature flags based on arch?

Actually, I added a feature "8-bit" that I select when compiling for AVR. But I think you can choose target_pointer_width as a substitute for target_arch.

Do you need help setting up for AVR? Here's the main.rs you can use as an example, with this Cargo.toml but you can remove extra stuff. You also need this target file avr-atmega328p.json

this is my final main.rs:

#![no_std]
#![no_main]

use panic_halt as _;

use arduino_hal::prelude::*;

use core::fmt::Debug;
use ufmt::uWrite;

use noble_secp256k1::awint::{cc, inlawi, inlawi_ty, Bits, InlAwi};
use noble_secp256k1::{BigNum, Curve};

fn print_hex_arr<S>(tag: &str, serial: &mut S, arr: &[u8])
where
    S: uWrite,
    <S as uWrite>::Error: Debug,
{
    ufmt::uwrite!(serial, "{} = ", tag).unwrap();
    for e in arr.iter().rev() {
        ufmt::uwrite!(serial, "{:02x}", *e).unwrap();
    }
    ufmt::uwrite!(serial, "\r\n").unwrap();
}

#[arduino_hal::entry]
fn main() -> ! {
    let dp = arduino_hal::Peripherals::take().unwrap();
    let pins = arduino_hal::pins!(dp);
    let mut serial = arduino_hal::default_serial!(dp, pins, 57600);

    let mut private_key = inlawi!(0x02_u512);
    let curve = Curve::secp256k1();
    let public_key = curve.multiply_simple(&mut private_key);

    let mut buf = [0; 32];
    public_key.x.to_u8_slice(&mut buf);
    print_hex_arr("x", &mut serial, &buf);
    public_key.y.to_u8_slice(&mut buf);
    print_hex_arr("y", &mut serial, &buf);

    ufmt::uwriteln!(&mut serial, "Hello from Arduino!\r").void_unwrap();
    loop { }
}

and make sure to import noble-secp256k1 in your Cargo.toml:

noble-secp256k1 = { git = "https://github.com/xphoniex/noble-secp256k1-rs", default-features = false, features = ["8-bit"] }

and .cargo/config.toml:

rustflags = ["-Z emit-stack-sizes"]

[target.avr-atmega328p]
runner = "qemu-system-avr -M uno -nographic -serial tcp::5678,server=on -bios"

[build]
target = "avr-atmega328p.json"

[unstable]
build-std = ["core"]

you also need avr toolchain like avr-gcc & qemu-system-avr installed, and then you can run your generated elf file with:

# term 1
$ cargo run -r

# or

$ qemu-system-avr -M uno -bios target/avr-atmega328p/release/<file>.elf -nographic -serial tcp::5678,server=on

# term 2
$ telnet localhost 5678
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
x = c6047f9441ed7d6d3045406e95c07cd85c778e4b8cef3ca7abac09b95c709ee5
y = b7c52588d95c3b9aa25b0403f1eef75702e84bb7597aabe663b82f6f04ef2777
Hello from Arduino!

once you're done, terminate qemu.

I've simplified my mul_mod ops so I think I can get memory usage down. If there are other issues with double allocation as you mentioned in assembly section, and you can get it fixed, that'd be ever better.

I'll have to add sha256 and then hmac first though.

AaronKutch commented 1 year ago

Re the assembly, is that x64, arm, or avr? You're initializing a f512 value in inlawi!. Can you comment out the allocations in assembly code?

I was using riscv32i because it has the simplest assembly to understand. Ignore the f512 part, I was using fixed point notation to fill the constant with psuedorandom information so that the compiler didn't optimize it away. On an assembly code level, stack allocations have been amalgamated together in a single chunk of bytes per function. What happens is that the stack pointer sp is subtracted by the total number of bytes of all the stuff needed beyond the registers. In the first example, I can tell that double the allocations needed are happening because I have a 128 bytes (edit: I forgot about the meta data digit, it should be 130) for a 1024 bit InlAwi, but you can see addi sp, sp, -288 which is about double that. There are some sw store word operations happening to put some smaller things on the stack, and then you can see two places addi a0, sp, 136 where the assembly is navigating to the two different compiled stack allocations. The group of operations

        li a2, 128
        li a1, 0
        call memset@plt

Is one of the 128 byte allocations being memset to 0. In the second example it would memcpy into a single allocation and didn't need the memset since things would be initialized properly.

xphoniex commented 1 year ago

Gotcha, thanks for explanation. Is this compiler's fault?

AaronKutch commented 1 year ago

Is this compiler's fault?

I don't know where in the chain the extra alloc is introduced, but it should be gone in version 0.10.0 which I just released, here is the changelog. In a future version https://github.com/AaronKutch/awint/issues/15 should be implemented, it unfortunately still requires two large storages but should significantly improve execution time and code size.

AaronKutch commented 1 year ago

when I get around to attempting #18 things should improve further

AaronKutch commented 1 year ago

I just published v0.12.0, some things should be improved. MSRV is 1.70 now.

AaronKutch / awint

Using `awint` on AVR #12