NVlabs / CGBN

CGBN: CUDA Accelerated Multiple Precision Arithmetic (Big Num) using Cooperative Groups
Other
200 stars 53 forks source link

`cgbn_mont_sqr(r, a, modulus)` returns result > modulus #15

Open sethtroisi opened 3 years ago

sethtroisi commented 3 years ago

I have found a case with N=971 bits where cgbn_mont_sqr(r, a, n) returns (a^2 % n) + n

I made a minimal repro branch here

Would you be willing to assist in debugging? Any ideas what might be wrong?

sethtroisi commented 3 years ago

I increased the reproducibility by setting

a = n - bn2mont(1)

With BITS=1024 TPI={8,16,32}, this fails for all 2^n-1, n >= 683 With BITS=512 TPI={8,16}, this fails for all 2^n-1, n>=342 With BITS=256 TPI=8, this fails for all 2^n-1 >= 171 This is probably n = 2/3 BITS

With BITS=512, TPI=32, seems to work on the tested inputs With BITS=256, TPI={16,32}, seems to work on the tested inputs With BITS=128, TPI={8,16,32}, seems to work on the tested inputs

sethtroisi commented 3 years ago

This error is present if I use core_mul_xmad.cu or core_mul_imad.cu

sethtroisi commented 3 years ago

I looked at testing this with unit_test but I found that experience hard. Is it expected that make takes ~90 seconds for tester? Is there a faster way to write a single unit test that tests cpu(mpz) vs gpu(CGBN)?

nemmart commented 3 years ago

Hi Seth,

Thanks for reporting. You have found a problem here, but it's more of a documentation issue than a coding issue. The mont_mul and mont_sqr routines use a redundant representation, and only subtract N if there was a carry out -- this is faster because it saves a comparison on each mont_sqr and mont_mul. The mont2bn routine is the final step and ensures that the result is normalized to N. I think the implementation is right, but it should be documented better about what's going on under the covers.

I'll leave this issue open until I update the docs.

Thank you, Niall