Open Keelan10 opened 1 year ago
Thank you for this, we will have a look. That might take some time.
Where does this implementation come from? Did you implement it yourself from the spec or did you transliterate an existing implementation?
Early comment: I see you use quite a lot of allocations:
BenchmarkDecapsulate/ntrulpr1013-8 22 48401778 ns/op 57280 B/op 135 allocs/op
BenchmarkDecapsulate/ntrulpr1277-8 14 76694781 ns/op 72673 B/op 145 allocs/op
BenchmarkDecapsulate/ntrulpr653-8 57 20311517 ns/op 39968 B/op 135 allocs/op
BenchmarkDecapsulate/ntrulpr761-8 40 28332384 ns/op 43752 B/op 135 allocs/op
BenchmarkDecapsulate/ntrulpr857-8 30 34925391 ns/op 50016 B/op 135 allocs/op
BenchmarkDecapsulate/ntrulpr953-8 25 43443037 ns/op 56544 B/op 135 allocs/op
BenchmarkEncapsulate/ntrulpr1013-8 33 32511700 ns/op 48160 B/op 97 allocs/op
BenchmarkEncapsulate/ntrulpr1277-8 20 53727275 ns/op 60928 B/op 103 allocs/op
BenchmarkEncapsulate/ntrulpr653-8 91 13530212 ns/op 33472 B/op 97 allocs/op
BenchmarkEncapsulate/ntrulpr761-8 66 19986460 ns/op 36792 B/op 97 allocs/op
BenchmarkEncapsulate/ntrulpr857-8 45 25415836 ns/op 41728 B/op 97 allocs/op
BenchmarkEncapsulate/ntrulpr953-8 37 31389920 ns/op 47232 B/op 97 allocs/op
BenchmarkDecapsulate/HPKE_KEM_P256_HKDF_SHA256-8 15246 78954 ns/op 3361 B/op 44 allocs/op
BenchmarkGenerateKeyPair/ntrulpr1013-8 67 17443585 ns/op 37824 B/op 1054 allocs/op
BenchmarkGenerateKeyPair/ntrulpr1277-8 45 26197393 ns/op 47760 B/op 1320 allocs/op
BenchmarkGenerateKeyPair/ntrulpr653-8 172 6903091 ns/op 25104 B/op 694 allocs/op
BenchmarkGenerateKeyPair/ntrulpr761-8 129 9488787 ns/op 28384 B/op 802 allocs/op
BenchmarkGenerateKeyPair/ntrulpr857-8 100 11835992 ns/op 32496 B/op 898 allocs/op
BenchmarkGenerateKeyPair/ntrulpr953-8 82 14593648 ns/op 36336 B/op 994 allocs/op
Those should be around 1-5.
For comparison:
BenchmarkGenerateKeyPair/Kyber768-8 36907 32178 ns/op 8256 B/op 5 allocs/op
BenchmarkEncapsulate/Kyber768-8 62809 18425 ns/op 1216 B/op 3 allocs/op
BenchmarkDecapsulate/Kyber768-8 62088 19122 ns/op 32 B/op 1 allocs/op
Also, the operations are very slow compared to the cycle counts given in the round 3 submission. That one uses AVX2 intrinsics, but the difference is bigger than can be explained by that.
Thank you for this, we will have a look. That might take some time.
Where does this implementation come from? Did you implement it yourself from the spec or did you transliterate an existing implementation?
I have translated it directly from the C reference/optimized implementation.
Also, the operations are very slow compared to the cycle counts given in the round 3 submission. That one uses AVX2 intrinsics, but the difference is bigger than can be explained by that.
Noted, I am having a look at it. I will revert back to you afterwards. Thanks.
I have managed to cut down the number of allocations.
BenchmarkGenerateKeyPair/ntrulpr1013-4 50 34254035 ns/op 23591 B/op 34 allocs/op
BenchmarkGenerateKeyPair/ntrulpr1277-4 16 67169300 ns/op 28929 B/op 36 allocs/op
BenchmarkGenerateKeyPair/ntrulpr653-4 69 17044200 ns/op 15605 B/op 34 allocs/op
BenchmarkGenerateKeyPair/ntrulpr761-4 50 25825864 ns/op 17712 B/op 34 allocs/op
BenchmarkGenerateKeyPair/ntrulpr857-4 39 31935840 ns/op 20178 B/op 34 allocs/op
BenchmarkGenerateKeyPair/ntrulpr953-4 33 36169091 ns/op 22640 B/op 34 allocs/op
BenchmarkEncapsulate/ntrulpr1013-4 26 58675716 ns/op 39601 B/op 89 allocs/op
BenchmarkEncapsulate/ntrulpr1277-4 16 102924990 ns/op 50145 B/op 95 allocs/op
BenchmarkEncapsulate/ntrulpr653-4 64 22068697 ns/op 27104 B/op 89 allocs/op
BenchmarkEncapsulate/ntrulpr761-4 37 31186720 ns/op 30256 B/op 89 allocs/op
BenchmarkEncapsulate/ntrulpr857-4 37 39697089 ns/op 33818 B/op 89 allocs/op
BenchmarkEncapsulate/ntrulpr953-4 24 47283570 ns/op 38069 B/op 89 allocs/op
BenchmarkDecapsulate/ntrulpr1013-4 18 86339886 ns/op 47548 B/op 126 allocs/op
BenchmarkDecapsulate/ntrulpr1277-4 9 135162156 ns/op 59486 B/op 136 allocs/op
BenchmarkDecapsulate/ntrulpr653-4 42 35454582 ns/op 32521 B/op 126 allocs/op
BenchmarkDecapsulate/ntrulpr761-4 28 43609793 ns/op 35601 B/op 126 allocs/op
BenchmarkDecapsulate/ntrulpr857-4 21 55018576 ns/op 40652 B/op 126 allocs/op
BenchmarkDecapsulate/ntrulpr953-4 19 75017090 ns/op 45854 B/op 126 allocs/op
The number of allocations for streamlined NTRU prime is given below:
BenchmarkGenerateKeyPair/sntrup653-4 10 110994258 ns/op 14825 B/op 27 allocs/op
BenchmarkGenerateKeyPair/sntrup761-4 8 141809830 ns/op 16970 B/op 27 allocs/op
BenchmarkGenerateKeyPair/sntrup857-4 4 294731995 ns/op 19296 B/op 27 allocs/op
BenchmarkGenerateKeyPair/sntrup953-4 4 364868864 ns/op 21984 B/op 27 allocs/op
BenchmarkGenerateKeyPair/sntrup1013-4 3 446066821 ns/op 23360 B/op 27 allocs/op
BenchmarkGenerateKeyPair/sntrup1277-4 2 619764119 ns/op 28904 B/op 29 allocs/op
BenchmarkEncapsulate/sntrup653-4 78 14519160 ns/op 17904 B/op 74 allocs/op
BenchmarkEncapsulate/sntrup761-4 88 21360334 ns/op 19784 B/op 74 allocs/op
BenchmarkEncapsulate/sntrup857-4 76 29744781 ns/op 22736 B/op 74 allocs/op
BenchmarkEncapsulate/sntrup953-4 38 32997382 ns/op 25456 B/op 74 allocs/op
BenchmarkEncapsulate/sntrup1013-4 27 44015633 ns/op 26288 B/op 74 allocs/op
BenchmarkEncapsulate/sntrup1277-4 19 55111287 ns/op 33744 B/op 80 allocs/op
BenchmarkDecapsulate/sntrup653-4 38 46424767 ns/op 23728 B/op 114 allocs/op
BenchmarkDecapsulate/sntrup761-4 28 66217428 ns/op 25752 B/op 114 allocs/op
BenchmarkDecapsulate/sntrup857-4 15 77064377 ns/op 30032 B/op 114 allocs/op
BenchmarkDecapsulate/sntrup953-4 10 100944968 ns/op 33520 B/op 114 allocs/op
BenchmarkDecapsulate/sntrup1013-4 13 108477276 ns/op 34160 B/op 114 allocs/op
BenchmarkDecapsulate/sntrup1277-4 7 167687608 ns/op 43474 B/op 124 allocs/op
I have identified that the remaining allocations come mostly from the the following functions:
Could I ask you to provide pointers on how to optimize these functions please? Thank you.
I have translated it directly from the C reference/optimized implementation.
Then you should appropriately attribute the original.
Could I ask you to provide pointers on how to optimize these functions please? Thank you.
Are you familiar with Go's heap escape mechanism? Typically allocations can be removed by avoiding interfaces, function types and having the caller allocate values.
For an example of the latter:
func Pack(something T) []byte {
ret := make([]byte, 123)
// write into ret
return ret
}
will have an unavoidable allocation, whereas
func Pack(ret []byte, something T) {
// write into ret
}
func SomeOtherFunc() {
var something T
// ...
out := make([]byte, 123)
Pack(out, something)
}
will not.
@Keelan10 could you please format the files so the linter checks passes.
Then you should appropriately attribute the original.
I added it to the doc.go file.
Are you familiar with Go's heap escape mechanism? Typically allocations can be removed by avoiding interfaces, function types and having the caller allocate values.
Thank you for this.
@Keelan10 could you please format the files so the linter checks passes.
Yes, I have formatted the files accordingly. Thanks.
After optimizing the Encode function, the number of allocations is as follows:
BenchmarkGenerateKeyPair/ntrulpr1013-4 57 19854325 ns/op 21216 B/op 16 allocs/op
BenchmarkGenerateKeyPair/ntrulpr1277-4 37 32050562 ns/op 26336 B/op 16 allocs/op
BenchmarkGenerateKeyPair/ntrulpr653-4 130 8618956 ns/op 13920 B/op 16 allocs/op
BenchmarkGenerateKeyPair/ntrulpr761-4 100 15008387 ns/op 15840 B/op 16 allocs/op
BenchmarkGenerateKeyPair/ntrulpr857-4 76 17634686 ns/op 18144 B/op 16 allocs/op
BenchmarkGenerateKeyPair/ntrulpr953-4 34 30173744 ns/op 20192 B/op 16 allocs/op
BenchmarkEncapsulate/ntrulpr1013-4 18 86437308 ns/op 36608 B/op 71 allocs/op
BenchmarkEncapsulate/ntrulpr1277-4 9 114803620 ns/op 46017 B/op 75 allocs/op
BenchmarkEncapsulate/ntrulpr653-4 57 27666982 ns/op 25408 B/op 71 allocs/op
BenchmarkEncapsulate/ntrulpr761-4 31 50561442 ns/op 28048 B/op 71 allocs/op
BenchmarkEncapsulate/ntrulpr857-4 18 66679760 ns/op 31552 B/op 71 allocs/op
BenchmarkEncapsulate/ntrulpr953-4 15 71584093 ns/op 35776 B/op 71 allocs/op
BenchmarkDecapsulate/ntrulpr1013-4 14 85484315 ns/op 43424 B/op 107 allocs/op
BenchmarkDecapsulate/ntrulpr1277-4 7 150202433 ns/op 54818 B/op 115 allocs/op
BenchmarkDecapsulate/ntrulpr653-4 38 35952484 ns/op 30240 B/op 107 allocs/op
BenchmarkDecapsulate/ntrulpr761-4 33 38854777 ns/op 33216 B/op 107 allocs/op
BenchmarkDecapsulate/ntrulpr857-4 25 53645810 ns/op 37792 B/op 107 allocs/op
BenchmarkDecapsulate/ntrulpr953-4 20 60605877 ns/op 42784 B/op 107 allocs/op
BenchmarkGenerateKeyPair/sntrup653-4 12 123409579 ns/op 11968 B/op 8 allocs/op
BenchmarkGenerateKeyPair/sntrup761-4 8 132397397 ns/op 13889 B/op 8 allocs/op
BenchmarkGenerateKeyPair/sntrup857-4 7 159702746 ns/op 15680 B/op 8 allocs/op
BenchmarkGenerateKeyPair/sntrup953-4 6 218574910 ns/op 17984 B/op 8 allocs/op
BenchmarkGenerateKeyPair/sntrup1013-4 4 295355986 ns/op 19264 B/op 8 allocs/op
BenchmarkGenerateKeyPair/sntrup1277-4 2 567121358 ns/op 23616 B/op 8 allocs/op
BenchmarkEncapsulate/sntrup653-4 66 16461438 ns/op 15056 B/op 55 allocs/op
BenchmarkEncapsulate/sntrup761-4 51 19784946 ns/op 16704 B/op 55 allocs/op
BenchmarkEncapsulate/sntrup857-4 51 25694787 ns/op 19120 B/op 55 allocs/op
BenchmarkEncapsulate/sntrup953-4 39 33983371 ns/op 21456 B/op 55 allocs/op
BenchmarkEncapsulate/sntrup1013-4 38 35351851 ns/op 22192 B/op 55 allocs/op
BenchmarkEncapsulate/sntrup1277-4 18 58978942 ns/op 28464 B/op 59 allocs/op
BenchmarkDecapsulate/sntrup653-4 39 39306603 ns/op 17360 B/op 91 allocs/op
BenchmarkDecapsulate/sntrup761-4 26 71803380 ns/op 18832 B/op 91 allocs/op
BenchmarkDecapsulate/sntrup857-4 19 70680971 ns/op 21936 B/op 91 allocs/op
BenchmarkDecapsulate/sntrup953-4 13 100234053 ns/op 24400 B/op 91 allocs/op
BenchmarkDecapsulate/sntrup1013-4 10 165934875 ns/op 24944 B/op 91 allocs/op
BenchmarkDecapsulate/sntrup1277-4 6 167856732 ns/op 31666 B/op 99 allocs/op
Also, I am not sure how to optimize the Decode function and any suggestion would be greatly appreciated. Thank you.
Hello,
This is a draft of NTRU Prime which implements all the parameters specified in the round 3 submission. It passes the KAT tests provided in the NIST submission package.
Thanks for reviewing the PR!