Consider avoiding a local buffer

lemire commented 3 years ago

Your SIMD code writes to a local buffer, and this local buffer is copied in a second step to the output buffer. Consider doing a single pass, writing directly to the output. For the function to be safe, the output buffer must be fully allocated so that each 12-bit input word can generate a 16-bit output word, so it should be well defined to so. In the worse case, you may need a constant-time finisher pass to handle the last iteration.

cothan commented 3 years ago

Hi Professor Lemire,

Thank you for taking your valuable time looking at my code. So I wrote another function with 2 passes, first pass writes directly to output, and second pass to local buffer, and then copy local buffer to output. The transfer size from local to output buffer this time is less than 32 bytes, thus, the transfer size decrease from 256 to less than 32. As I benchmark the neon_rejection_sampling_mix, it gives the best cycles count compare to 2 other approaches.

Here is the link to the function I wrote: For short, this function is a hybrid approach. https://github.com/cothan/block_sort/blob/main/benchmark_sorting_const.c#L565

And here I add the new benchmark result:

	C	NEON-Full	Ratio	NEON-Half	Ratio	NEON-Mix	Ratio
Rejection Sampling	1686	773	2.18	1250	1.34	765	2.20

Thank you very much.

lemire commented 3 years ago

That sounds like a good approach indeed.

I would further try to modify the main loop so that the conditions in the loop can be checked faster. Having two comparisons is maybe more expensive than you'd like. It may add a couple of instructions per iteration.

lemire commented 3 years ago

And note that you usually can't retire two branch-and-jump instructions per cycle... so it may not be possible to hide the cost.

Alternatively, you may try with even more unrolling... that is, process more data per iteration.

cothan / block_sort

Consider avoiding a local buffer #1