Closed dyc3 closed 1 year ago
Ok, I was able to verify that I ported all the macros correctly (at least as far as I could tell), and I fixed some stuff. I'm still not entirely sure where I'm going wrong here, as the tests are still failing.
I found out there was a way to tell the compiler to generate code based on a target feature being enabled with the #[target_feature(enable = "avx2")]
attribute. With this method, it likely be safe to do the same thing with avx512, but I don't have a CPU that supports it to make sure. It also completely sidesteps the code gen problems I was having before (demonstrated in https://godbolt.org/z/TM94EbjKf).
https://godbolt.org/z/843z7eGYb demonstrates the code gen using this method, which is implemented in 70e34de.
Running benchmarks on my Ryzen 9 3900X, it shows about a 20-30% performance improvement. Would this solution be acceptable? I can clean up the commit history here if it matters.
So the compress function implementations are identical and the only difference is #[target_feature(enable = "avx2")]
, correct? I don't think we need to duplicate the function implementations. You could try to mark the compress function as #[inline(always)]
and then introduce an AVX2 wrapper around it.
I've tried my hardest to get this to work, but I think I'm in a little over my head. The most I've managed to do is port the macros from the optimized reference version.
My main problem is that I have no idea how these indexes intostate
work:https://github.com/P-H-C/phc-winner-argon2/blob/92cd2e1/src/opt.c#L94-L102If someone could give me a hand, I'd be happy to fix/finish this. This is the first time I've ever tried to do stuff with SIMD, so bear with me as I'm still learning.
related to #104