Open cocreature opened 7 years ago
I’ve just pushed the implementation I currently have but with the sparse representation disabled for now. The basics should be there (but I’m unable to test because Thrill doesn’t execute the lambda, so it’s probably buggy :)). The three improvements listed above are not yet implemented and atm I also use an index-value pair instead of encoding this pair in a single integer/bitfield.
I’m unassigning this because I want to dig into the Thrill bug we’re seeing and that could take a while.
Alright, the sparse representation is now actually somewhat sparse and it uses a larger precision. However the precision is currently just dropped before we calculate the result.
What’s left to do is further compression as described in section 5.3.2
and 5.3.3
of the Google paper.
Sorry for not responding in 4 days. I'll look into it next week (probably around Tuesday/Wednesday).
Thrill already contains a varint implementation so it probably makes sense to just use that. I might get to the varint/difference encoding tomorrow. If not you’re ofc welcome to tackle this :)
I’ve just implemented varint and difference encoding & decoding. Decoding is already reasonably clean but encoding currently constructs a new array. We should probably create an output iterator that merges values with the same indices and encoding the values so that we can avoid constructing the intermediate arrays in mergeSparse
. But that is just a (probably) small optimization and should not affect the memory representation and the results so there is no need to do that immediately.
What is left to do is encoding the hash values (that should be easy, the additional bit needed for that is already present in our encoding) and making use of the additional precision instead of just discarding it (should be easy as well). Once we have finished the encoding we also need to calculate the proper thresholds at which we merge tmpList
and sparseList
and for switching to the dense representation.
I've just implemented the hash encoding. I'll look into the merge threshold for tmpList
and sparseList
tomorrow.
@TiFu Have you forgotten to push the hash encoding? I don’t see it in the repo.
That seems to be the case. I'll push it later and implement the usage of the higher precision in the second phase of the algorithm.
Moritz Kiefer notifications@github.com schrieb am Di., 31. Jan. 2017 13:13:
@TiFu https://github.com/TiFu Have you forgotten to push the hash encoding? I don’t see it in the repo.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/cocreature/thrill/issues/5#issuecomment-276348102, or mute the thread https://github.com/notifications/unsubscribe-auth/AFs3Z-azsGk5qKQbQNJ1J9L6kfAZN3wYks5rXyV2gaJpZM4LfcyN .
I will need more time to work on this. I hope to finish everything tomorrow.
I'm having some major performance issues after implementing the threshold for changing from sparse to dense representation.
I have pushed my changes on encodingThresholdSparse
. It seems like 65a46191f9854cec29c79cf1fa1a605d1a68e45c is the commit introducing the bottleneck, but I couldn't figure out why. The performance hit only appears during the first part of the HyperLogLog Test. If you comment out the first tests with n = 100000, the performance is completely normal. This would indicate, that the conversion is indeed the culprit because the conversion only occurs with higher n.
Actually, this could be due to a far higher size of the sparse list after removing the artificial 200 element threshold. There's quite some work done in the first for loop of toDense() (iteration over the sparse list).
Would you mind taking a look at the code? Maybe you'll notice something, which I missed.
Are the following sizes for the sparse and dense representation correct?
size_t sparseSize = tmpSet.size() * sizeof(SparseRegister) + sparseListBuffer.size() * sizeof(uint8_t);
size_t denseSize = (1 << p) * sizeof(uint8_t);
The denseSize should be 2^p * 8 Bit, because there are 2^p registers à 8 Bit. SparseSize: basically size of the tmpSet + size of the sparseListBuffer.
I’ll look into this tomorrow. The sizes seem to be correct.
@TiFu Just to make sure I’m seeing the same performance issues that you are seeing: The tests run about 2.5 times slower for me with this commit included and the total runtime is around 350ms. Is this comparable to the regression you found?
The performance bottleneck seems to be in mergeSparse
(75% of our total runtime according to perf
although the attribution looks incorrect/weird). By increasing the threshold you also increased the calls to mergeSparse
and thereby increased the runtime. A simple improvement that is also used in the Google paper is to set the tmpset threshold to 25% of the sparselist threshold (or some approximation thereof).
I’m looking into optimizing mergeSparse
itself but I’m not yet sure how.
The runtime on my machine is 7 seconds. I will look into optimizing mergeSparse as well.
Moritz Kiefer notifications@github.com schrieb am Fr., 3. Feb. 2017 09:40:
The performance bottleneck seems to be in mergeSparse (75% of our total runtime according to perf although the attribution looks incorrect/weird). By increasing the threshold you also increased the calls to mergeSparse and thereby increased the runtime. A simple improvement that is also used in the Google paper is to set the tmpset threshold to 25% of the sparselist threshold (or some approximation thereof).
I’m looking into optimizing mergeSparse itself but I’m not yet sure how.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/cocreature/thrill/issues/5#issuecomment-277193527, or mute the thread https://github.com/notifications/unsubscribe-auth/AFs3Z3-RC1PoqZx8syrLUvRAuxOUfjwWks5rYugLgaJpZM4LfcyN .
@TiFu That sounds like you are using a debug build which is fine for seeing if your code works but you don’t want to use that for benchmarking. If you pass -DCMAKE_BUILD_TYPE=Release
to cmake it will compile with optimizations.
Setting MAX_TMPSET_SIZE
to 200 (which should still be below the 25% recommended in the paper) brings the runtime down to 90% so just reducing the number of calls to mergeSparse
seems to more effective than trying to to optimize mergeSparse
.
FYI: It seems like I committed a buggy version (and accidentally pushed it). We currently have a relative error of 2.6 (Yes 2.6) for the 1000 element test. I will look into it later today.
2.6 is in O(1), so this is fine.
@TiFu Have you made any progress here? Otherwise I can look into this this evening.
I added a sparseCount to Registers. It counts the elements in the sparse list. The count is guaranteed to be correct after a call to mergeSparse (before that we don't know if the same index is already saved). This reduces the relative error to 1.45 * 10^-5.
I still wasn't able to reduce the error rate for precision 14 at about 17000 elements (see chart). I think the issue might be that we change too early to the dense representation. If you've time you could look into that this evening, that would be great.
I’m not sure if we need to worry about the slightly higher error rate around 17000. The error rates are still pretty low. It is definitely not related to the point at which we switch to the dense representation, you can see the same phenomenon in the graphs I posted before we started using the additional precision available in the sparse representation. Most likely this is caused by slightly inappropriate bias values. I’ll check them again but maybe the values provided by google don’t quite match the implementation described in their paper.
What I’m more worried about is that we lose the additional precision (i.e. switch to the dense representation) very early (around 4000). The graphs in the paper suggest that they only switch to the dense representation around 11000. I’ll see if I can find something that’s off in the calculation used for this.
The bias values looks fine (at least for precision 14, I didn’t check other precisions) so the reason for the slightly increased errors hear is probably something else (and imho negligible). I also looked at the threshold calculation but so far I’ve been unable to find anything suspicious. I suspect that their encoding might just be slightly more efficient (e.g. group varint). I might look at it again in the next few days.
This is the final improvement suggested in the google paper. The basic idea is to use a list of
(index, value)
pairs if the number of registers with values≠ 0
is small to reduce the necessary space. If this list would take up more space than the dense representation it is converted to an array.They suggest 3 further improvements related to this:
LinearCounting
does not use them (see the paper for the exact encoding).