While looking through the code I noticed that a comment describing why certain permutations are done a certain way with intrinsics did not reflect what was actually done in the code. Specifically the permutation set mentioned in the comment was
{0123, 1032, 2301, 3012} while the actual permutations are {0123, 1032, 3210, 2301}.
In addition the comment mentioned alternative selections that were partly not correct with respect to what the code was actually doing. I updated them and verified the correctness according to the intel docs as well as tests on godbolt.
While looking through the code I noticed that a comment describing why certain permutations are done a certain way with intrinsics did not reflect what was actually done in the code. Specifically the permutation set mentioned in the comment was {0123, 1032, 2301, 3012} while the actual permutations are {0123, 1032, 3210, 2301}.
In addition the comment mentioned alternative selections that were partly not correct with respect to what the code was actually doing. I updated them and verified the correctness according to the intel docs as well as tests on godbolt.