On NEON, it is possible to do slightly better than the current 3-to-4 byte encoding shuffle, by using SLI instructions and some cleverness. We can also do better or equal than the compiler if we use inline assembly to manually pipeline these instructions.
On NEON, it is possible to do slightly better than the current 3-to-4 byte encoding shuffle, by using SLI instructions and some cleverness. We can also do better or equal than the compiler if we use inline assembly to manually pipeline these instructions.