Closed jhcarl0814 closed 1 year ago
Thanks for the patch. This changes a lot of things that are unrelated to the feature. Please split this patch and put the cosmetic changes (wrapping min/max etc) in another PR.
Don't make the data structure a template argument. It is fine to replace the standard implementation, because the speed is essentially the same and the additional memory overhead is small. The core logic of the eytzinger search should be implemented in a seperate struct or class, as it is now, so that the core algorithm can be tested in isolation (and potentially reused).
Perhaps the vector should be merged into our eytzinger class, although this may have unwanted side effects for the design.
@HDembinski Thank you for simplifying all this! Now I can get rid of the four special members and the extra ctor
and extra assign
.
The cosmetic changes (wrapping min/max etc) are split into https://github.com/boostorg/histogram/pull/377 .
vec_
is outside ofsorted_array_and_binary_search
andeytzinger_layout_and_eytzinger_binary_search
to make it easier for outsiders to getvec_
without knowing the type ofdata_structure_
used
We don't allow direct access to the array, so it is fine if we internally change the data structure from vector_type to something else. However, users of variable
must be able to look up a bin by index resonably efficiently. This happens through
auto bin(index_type idx) const noexcept { return interval_view<variable>(*this, idx); }
A generic iterator and begin() and end() are generated through mixin classes, which call this interface.
In most cases, users will iterate over the values in their original order, so if we need to make performance trade-offs, this is the case we should focus on. Ideally, we keep only the eytzinger-sorted array and not the original array.
@HDembinski According to https://www.hyrumslaw.com/ the users of auto bin(index_type idx)
might have already depended on it being O(1)
. If we remove the original vec_
and modify value_type value(real_index_type i)
to use i
to locate the bin in the eytzinger complete binary tree, the time complexity becomes O(lg(n))
and will break the previously not documented "interface".
@HDembinski According to https://www.hyrumslaw.com/ the users of
auto bin(index_type idx)
might have already depended on it beingO(1)
. If we remove the originalvec_
and modifyvalue_type value(real_index_type i)
to usei
to locate the bin in the eytzinger complete binary tree, the time complexity becomesO(lg(n))
and will break the previously not documented "interface".
That's all true, however: 1) We do not consider that a breaking change in Boost. Breaking changes are those which make user code not compile anymore and we do not violate any explicit guarantee given in the documentation. The complexity of bin value look-up was not defined in the documentation, which means it is implementation defined and can change at any time. 2) For the small N that we deal with here, the difference between O(1) and O(log(N)) should not be dramatic. 3) In the typical use of a histogram, you need bin index look-up to be very fast because that's performed in a hot loop, but bin value look-up does not happen in a hot loop.
I am more concerned about the increase in memory from holding two arrays. We already need two arrays instead of one in the current implementation, to hold the eytzinger-sorted values and the indices. Adding yet another array makes the implementation unfavorable in my view. The performance increase of eytzinger is not so dramatic that tripling the amount of required memory seems worth it.
If the cons outweigh the pros, I would abandon this change.
@jhcarl0814 I have been thinking about this change and after all, the costs do seem to outweigh the benefits. Eytzinger users more storage, iteration over the axis will become slower, and the benefits are only substantial if you make a very large axis that we do not encounter in real life. In many cases, the axis should fit into the L1 cache, and then the Eytzinger layout should not provide any benefits.
@HDembinski You are right.
push_back
(or push_front
) quickly like std::vector
(or std::deque
) because it was designed to be a search-only data structure.I should have done more study before suggesting this thing. Sorry for taking up your time.
I should have done more study before suggesting this thing. Sorry for taking up your time.
No need to apologize. I could equally apologize to you, since I asked you to prepare a PR too early. We could have figured out these caveats without the PR. Thank you anyway, this was very interesting for me.
design decisions (very very questionable):
vec_
is outside ofsorted_array_and_binary_search
andeytzinger_layout_and_eytzinger_binary_search
to make it easier for outsiders to getvec_
without knowing the type ofdata_structure_
useddata_structure_
is needed whenevervec_
constructs or changes (forgetting to add copy constructor, move constructor, copy assignment operator and move assignment operator causes hard-to-track bugs), andsorted_array_and_binary_search
andeytzinger_layout_and_eytzinger_binary_search
each gets an extraconstructor
and an extraassign
to makevariable
's move constructor and move assignment operatornoexcept
.sorted_array_and_binary_search
's does re-construction fromvec_
and ignoresrhs
because it's shallow;eytzinger_layout_and_eytzinger_binary_search
's doesstd::move
fromrhs
and ignoresvec_
because it has state and the function needs to benoexcept
.