cloudflare / circl

CIRCL: Cloudflare Interoperable Reusable Cryptographic Library
http://blog.cloudflare.com/introducing-circl
Other
1.23k stars 136 forks source link

Implement NTRU Prime #384

Open Keelan10 opened 1 year ago

Keelan10 commented 1 year ago

Hello,

This is a draft of NTRU Prime which implements all the parameters specified in the round 3 submission. It passes the KAT tests provided in the NIST submission package.

Thanks for reviewing the PR!

bwesterb commented 1 year ago

Thank you for this, we will have a look. That might take some time.

Where does this implementation come from? Did you implement it yourself from the spec or did you transliterate an existing implementation?

bwesterb commented 1 year ago

Early comment: I see you use quite a lot of allocations:

BenchmarkDecapsulate/ntrulpr1013-8                                    22      48401778 ns/op       57280 B/op        135 allocs/op
BenchmarkDecapsulate/ntrulpr1277-8                                    14      76694781 ns/op       72673 B/op        145 allocs/op
BenchmarkDecapsulate/ntrulpr653-8                                     57      20311517 ns/op       39968 B/op        135 allocs/op
BenchmarkDecapsulate/ntrulpr761-8                                     40      28332384 ns/op       43752 B/op        135 allocs/op
BenchmarkDecapsulate/ntrulpr857-8                                     30      34925391 ns/op       50016 B/op        135 allocs/op
BenchmarkDecapsulate/ntrulpr953-8                                     25      43443037 ns/op       56544 B/op        135 allocs/op
BenchmarkEncapsulate/ntrulpr1013-8                                    33      32511700 ns/op       48160 B/op         97 allocs/op
BenchmarkEncapsulate/ntrulpr1277-8                                    20      53727275 ns/op       60928 B/op        103 allocs/op
BenchmarkEncapsulate/ntrulpr653-8                                     91      13530212 ns/op       33472 B/op         97 allocs/op
BenchmarkEncapsulate/ntrulpr761-8                                     66      19986460 ns/op       36792 B/op         97 allocs/op
BenchmarkEncapsulate/ntrulpr857-8                                     45      25415836 ns/op       41728 B/op         97 allocs/op
BenchmarkEncapsulate/ntrulpr953-8                                     37      31389920 ns/op       47232 B/op         97 allocs/op
BenchmarkDecapsulate/HPKE_KEM_P256_HKDF_SHA256-8                   15246         78954 ns/op        3361 B/op         44 allocs/op
BenchmarkGenerateKeyPair/ntrulpr1013-8                                67      17443585 ns/op       37824 B/op       1054 allocs/op
BenchmarkGenerateKeyPair/ntrulpr1277-8                                45      26197393 ns/op       47760 B/op       1320 allocs/op
BenchmarkGenerateKeyPair/ntrulpr653-8                                172       6903091 ns/op       25104 B/op        694 allocs/op
BenchmarkGenerateKeyPair/ntrulpr761-8                                129       9488787 ns/op       28384 B/op        802 allocs/op
BenchmarkGenerateKeyPair/ntrulpr857-8                                100      11835992 ns/op       32496 B/op        898 allocs/op
BenchmarkGenerateKeyPair/ntrulpr953-8                                 82      14593648 ns/op       36336 B/op        994 allocs/op

Those should be around 1-5.

For comparison:

BenchmarkGenerateKeyPair/Kyber768-8                                36907         32178 ns/op        8256 B/op          5 allocs/op
BenchmarkEncapsulate/Kyber768-8                                    62809         18425 ns/op        1216 B/op          3 allocs/op
BenchmarkDecapsulate/Kyber768-8                                    62088         19122 ns/op          32 B/op          1 allocs/op

Also, the operations are very slow compared to the cycle counts given in the round 3 submission. That one uses AVX2 intrinsics, but the difference is bigger than can be explained by that.

Keelan10 commented 1 year ago

Thank you for this, we will have a look. That might take some time.

Where does this implementation come from? Did you implement it yourself from the spec or did you transliterate an existing implementation?

I have translated it directly from the C reference/optimized implementation.

Keelan10 commented 1 year ago

Also, the operations are very slow compared to the cycle counts given in the round 3 submission. That one uses AVX2 intrinsics, but the difference is bigger than can be explained by that.

Noted, I am having a look at it. I will revert back to you afterwards. Thanks.

Keelan10 commented 1 year ago

I have managed to cut down the number of allocations.

BenchmarkGenerateKeyPair/ntrulpr1013-4                                50          34254035 ns/op           23591 B/op         34 allocs/op
BenchmarkGenerateKeyPair/ntrulpr1277-4                                16          67169300 ns/op           28929 B/op         36 allocs/op
BenchmarkGenerateKeyPair/ntrulpr653-4                                 69          17044200 ns/op           15605 B/op         34 allocs/op
BenchmarkGenerateKeyPair/ntrulpr761-4                                 50          25825864 ns/op           17712 B/op         34 allocs/op
BenchmarkGenerateKeyPair/ntrulpr857-4                                 39          31935840 ns/op           20178 B/op         34 allocs/op
BenchmarkGenerateKeyPair/ntrulpr953-4                                 33          36169091 ns/op           22640 B/op         34 allocs/op
BenchmarkEncapsulate/ntrulpr1013-4                                    26          58675716 ns/op           39601 B/op         89 allocs/op
BenchmarkEncapsulate/ntrulpr1277-4                                    16         102924990 ns/op           50145 B/op         95 allocs/op
BenchmarkEncapsulate/ntrulpr653-4                                     64          22068697 ns/op           27104 B/op         89 allocs/op
BenchmarkEncapsulate/ntrulpr761-4                                     37          31186720 ns/op           30256 B/op         89 allocs/op
BenchmarkEncapsulate/ntrulpr857-4                                     37          39697089 ns/op           33818 B/op         89 allocs/op
BenchmarkEncapsulate/ntrulpr953-4                                     24          47283570 ns/op           38069 B/op         89 allocs/op
BenchmarkDecapsulate/ntrulpr1013-4                                    18          86339886 ns/op           47548 B/op        126 allocs/op
BenchmarkDecapsulate/ntrulpr1277-4                                     9         135162156 ns/op           59486 B/op        136 allocs/op
BenchmarkDecapsulate/ntrulpr653-4                                     42          35454582 ns/op           32521 B/op        126 allocs/op
BenchmarkDecapsulate/ntrulpr761-4                                     28          43609793 ns/op           35601 B/op        126 allocs/op
BenchmarkDecapsulate/ntrulpr857-4                                     21          55018576 ns/op           40652 B/op        126 allocs/op
BenchmarkDecapsulate/ntrulpr953-4                                     19          75017090 ns/op           45854 B/op        126 allocs/op

The number of allocations for streamlined NTRU prime is given below:

BenchmarkGenerateKeyPair/sntrup653-4                  10         110994258 ns/op           14825 B/op         27 allocs/op
BenchmarkGenerateKeyPair/sntrup761-4                   8         141809830 ns/op           16970 B/op         27 allocs/op
BenchmarkGenerateKeyPair/sntrup857-4                   4         294731995 ns/op           19296 B/op         27 allocs/op
BenchmarkGenerateKeyPair/sntrup953-4                   4         364868864 ns/op           21984 B/op         27 allocs/op
BenchmarkGenerateKeyPair/sntrup1013-4                  3         446066821 ns/op           23360 B/op         27 allocs/op
BenchmarkGenerateKeyPair/sntrup1277-4                  2         619764119 ns/op           28904 B/op         29 allocs/op
BenchmarkEncapsulate/sntrup653-4                      78          14519160 ns/op           17904 B/op         74 allocs/op
BenchmarkEncapsulate/sntrup761-4                      88          21360334 ns/op           19784 B/op         74 allocs/op
BenchmarkEncapsulate/sntrup857-4                      76          29744781 ns/op           22736 B/op         74 allocs/op
BenchmarkEncapsulate/sntrup953-4                      38          32997382 ns/op           25456 B/op         74 allocs/op
BenchmarkEncapsulate/sntrup1013-4                     27          44015633 ns/op           26288 B/op         74 allocs/op
BenchmarkEncapsulate/sntrup1277-4                     19          55111287 ns/op           33744 B/op         80 allocs/op
BenchmarkDecapsulate/sntrup653-4                      38          46424767 ns/op           23728 B/op        114 allocs/op
BenchmarkDecapsulate/sntrup761-4                      28          66217428 ns/op           25752 B/op        114 allocs/op
BenchmarkDecapsulate/sntrup857-4                      15          77064377 ns/op           30032 B/op        114 allocs/op
BenchmarkDecapsulate/sntrup953-4                      10         100944968 ns/op           33520 B/op        114 allocs/op
BenchmarkDecapsulate/sntrup1013-4                     13         108477276 ns/op           34160 B/op        114 allocs/op
BenchmarkDecapsulate/sntrup1277-4                      7         167687608 ns/op           43474 B/op        124 allocs/op

I have identified that the remaining allocations come mostly from the the following functions:

Could I ask you to provide pointers on how to optimize these functions please? Thank you.

bwesterb commented 1 year ago

I have translated it directly from the C reference/optimized implementation.

Then you should appropriately attribute the original.

bwesterb commented 1 year ago

Could I ask you to provide pointers on how to optimize these functions please? Thank you.

Are you familiar with Go's heap escape mechanism? Typically allocations can be removed by avoiding interfaces, function types and having the caller allocate values.

For an example of the latter:

func Pack(something T) []byte {
  ret := make([]byte, 123)
  // write into ret
  return ret
}

will have an unavoidable allocation, whereas

func Pack(ret []byte, something T) {
  // write into ret
}

func SomeOtherFunc() {
  var something T
   // ...
   out  := make([]byte, 123)
   Pack(out, something)
}

will not.

armfazh commented 1 year ago

@Keelan10 could you please format the files so the linter checks passes.

Keelan10 commented 1 year ago

Then you should appropriately attribute the original.

I added it to the doc.go file.

Are you familiar with Go's heap escape mechanism? Typically allocations can be removed by avoiding interfaces, function types and having the caller allocate values.

Thank you for this.

Keelan10 commented 1 year ago

@Keelan10 could you please format the files so the linter checks passes.

Yes, I have formatted the files accordingly. Thanks.

Keelan10 commented 1 year ago

After optimizing the Encode function, the number of allocations is as follows:

BenchmarkGenerateKeyPair/ntrulpr1013-4                                57          19854325 ns/op           21216 B/op         16 allocs/op
BenchmarkGenerateKeyPair/ntrulpr1277-4                                37          32050562 ns/op           26336 B/op         16 allocs/op
BenchmarkGenerateKeyPair/ntrulpr653-4                                130           8618956 ns/op           13920 B/op         16 allocs/op
BenchmarkGenerateKeyPair/ntrulpr761-4                                100          15008387 ns/op           15840 B/op         16 allocs/op
BenchmarkGenerateKeyPair/ntrulpr857-4                                 76          17634686 ns/op           18144 B/op         16 allocs/op
BenchmarkGenerateKeyPair/ntrulpr953-4                                 34          30173744 ns/op           20192 B/op         16 allocs/op
BenchmarkEncapsulate/ntrulpr1013-4                                    18          86437308 ns/op           36608 B/op         71 allocs/op
BenchmarkEncapsulate/ntrulpr1277-4                                     9         114803620 ns/op           46017 B/op         75 allocs/op
BenchmarkEncapsulate/ntrulpr653-4                                     57          27666982 ns/op           25408 B/op         71 allocs/op
BenchmarkEncapsulate/ntrulpr761-4                                     31          50561442 ns/op           28048 B/op         71 allocs/op
BenchmarkEncapsulate/ntrulpr857-4                                     18          66679760 ns/op           31552 B/op         71 allocs/op
BenchmarkEncapsulate/ntrulpr953-4                                     15          71584093 ns/op           35776 B/op         71 allocs/op
BenchmarkDecapsulate/ntrulpr1013-4                                    14          85484315 ns/op           43424 B/op        107 allocs/op
BenchmarkDecapsulate/ntrulpr1277-4                                     7         150202433 ns/op           54818 B/op        115 allocs/op
BenchmarkDecapsulate/ntrulpr653-4                                     38          35952484 ns/op           30240 B/op        107 allocs/op
BenchmarkDecapsulate/ntrulpr761-4                                     33          38854777 ns/op           33216 B/op        107 allocs/op
BenchmarkDecapsulate/ntrulpr857-4                                     25          53645810 ns/op           37792 B/op        107 allocs/op
BenchmarkDecapsulate/ntrulpr953-4                                     20          60605877 ns/op           42784 B/op        107 allocs/op
BenchmarkGenerateKeyPair/sntrup653-4                  12         123409579 ns/op           11968 B/op          8 allocs/op
BenchmarkGenerateKeyPair/sntrup761-4                   8         132397397 ns/op           13889 B/op          8 allocs/op
BenchmarkGenerateKeyPair/sntrup857-4                   7         159702746 ns/op           15680 B/op          8 allocs/op
BenchmarkGenerateKeyPair/sntrup953-4                   6         218574910 ns/op           17984 B/op          8 allocs/op
BenchmarkGenerateKeyPair/sntrup1013-4                  4         295355986 ns/op           19264 B/op          8 allocs/op
BenchmarkGenerateKeyPair/sntrup1277-4                  2         567121358 ns/op           23616 B/op          8 allocs/op
BenchmarkEncapsulate/sntrup653-4                      66          16461438 ns/op           15056 B/op         55 allocs/op
BenchmarkEncapsulate/sntrup761-4                      51          19784946 ns/op           16704 B/op         55 allocs/op
BenchmarkEncapsulate/sntrup857-4                      51          25694787 ns/op           19120 B/op         55 allocs/op
BenchmarkEncapsulate/sntrup953-4                      39          33983371 ns/op           21456 B/op         55 allocs/op
BenchmarkEncapsulate/sntrup1013-4                     38          35351851 ns/op           22192 B/op         55 allocs/op
BenchmarkEncapsulate/sntrup1277-4                     18          58978942 ns/op           28464 B/op         59 allocs/op
BenchmarkDecapsulate/sntrup653-4                      39          39306603 ns/op           17360 B/op         91 allocs/op
BenchmarkDecapsulate/sntrup761-4                      26          71803380 ns/op           18832 B/op         91 allocs/op
BenchmarkDecapsulate/sntrup857-4                      19          70680971 ns/op           21936 B/op         91 allocs/op
BenchmarkDecapsulate/sntrup953-4                      13         100234053 ns/op           24400 B/op         91 allocs/op
BenchmarkDecapsulate/sntrup1013-4                     10         165934875 ns/op           24944 B/op         91 allocs/op
BenchmarkDecapsulate/sntrup1277-4                      6         167856732 ns/op           31666 B/op         99 allocs/op

Also, I am not sure how to optimize the Decode function and any suggestion would be greatly appreciated. Thank you.