Open jeremy-murphy opened 1 month ago
If it actually does impact performance, I meant to say, an obvious change would be to store the previous value and just move one iterator.
I didn't see any downward spikes in the benchmark graph, but then you presumably weren't testing every value on the x axis?
The array sizes on the x axis are 16, 32, ..., 32768, 65536.
Just a thought, but in the worst case, when k is the right size so that the memory that the two iterators point to map to the same cache line, does it degrade performance?
When the whole array fits in cache factors other than memory proximity start to dominate. There are two function calls in the optimal solution (which might or might not be inlined) and more branches. I haven't done any profiling or looked at the disassembly but my intuition tells that these are the main reasons why the optimal solution is slower for small problem sizes.
If it actually does impact performance, I meant to say, an obvious change would be to store the previous value and just move one iterator.
I made no attempts at improving the performance of the "optimal" solution. It is taken verbatim from the author's post. I just wanted to highlight that theoretical algorithm complexity expressed in O-notation is a very crude measure of the code performance when implemented on a real architecture.
Just a thought, but in the worst case, when k is the right size so that the memory that the two iterators point to map to the same cache line, does it degrade performance? I can't remember how to do the calculation for mapping memory to cache lines, I'll have to look it up later. I didn't see any downward spikes in the benchmark graph, but then you presumably weren't testing every value on the x axis?