Closed aklomp closed 2 years ago
I don't have access to any AArch64 hardware, so I cannot verify that this code increases performance, or by how much exactly. Based on the results of #91 with the NEON32
codec, I'm confident in pushing the change though.
The CI does build and test AArch64 on a virtual machine, so I was able to functionally test the code before pushing it.
Like was done in #91 for
NEON32
, we can implement the inner encoding loop for theNEON64
encoder in inline assembly. This should guarantee that we get the assembly code that we want/expect. The inner encoding loop is quite simple, so there is no large cost to adding a second parallel implementation.