addthis / stream-lib

Stream summarizer and cardinality estimator.
Apache License 2.0
2.26k stars 556 forks source link

OffHeap RegisterSet support #103

Closed krummas closed 8 years ago

krummas commented 8 years ago

This makes it possible to keep the register set off heap - we need it in Cassandra for long lived HLLP instances (we use it for overlap estimation between sstables) https://issues.apache.org/jira/browse/CASSANDRA-11035

tea-dragon commented 8 years ago

I'm sorry, but this needs quite a bit of work to be acceptable. To start with, why didn't you try to just replace the int array with IntBuffer? Your current offheap implementation easily allows memory corruption / segfaults.

There are other significant issues as well as minor things like copying existing code style (there is in at least one case an if statement that uses braces replaced with one omitting them).

krummas commented 8 years ago

I disagree, but if it makes you feel better, I will refactor it to use a direct byte buffer

What other "significant" issues are you talking about? Not very helpful.

krummas commented 8 years ago

Why the hostility? Have I done anything wrong except try to help out your project?

tea-dragon commented 8 years ago

It's not intended as hostility. I'm falling asleep midway through and my girlfriend is waiting to finish watching a movie. Just terseness. Your original response indicated a need for details.

tea-dragon commented 8 years ago

but just for reference, here are some things you did wrong:

The reason it isn't helpful is because being helpful to that detail is time consuming and exhausting. So to summarize, the biggest thing you did wrong was submit code so awful that there is no way it is going to be accepted without either harming the library, or inordinate levels of effort on my part. Most likely to the extent of rewriting it entirely or wishing I had. However, since cassandra wants it, and enough people with the power to ask me to do things will care about that, that's going to happen... and at this point, really, already has.

I hope you understand that in your future endevours.

krummas commented 8 years ago

First, it is not the middle of the night, I live in Sweden

Second, I absolutely did not intend to be rude, maybe it was a language thing, I just accepted your comment and fixed the pull request the way you suggested.

krummas commented 8 years ago

The question about the other significant issues was just so that I could fix all the things at once and not have to wait.

tea-dragon commented 8 years ago

And it was my admiration for that sentiment that, despite my personal, local, situation, motivated me to highlight as many as I could before my brain completely failed.

I only mentioned the time add an explanation for the terse nature of some of the comments.

On Thu, Mar 10, 2016, 03:31 Marcus Eriksson notifications@github.com wrote:

The question about the other significant issues was just so that I could fix all the things at once and not have to wait.

— Reply to this email directly or view it on GitHub https://github.com/addthis/stream-lib/pull/103#issuecomment-194732511.

krummas commented 8 years ago

OK, sorry for any misunderstanding and I hope the new patch is more suited for inclusion

abramsm commented 8 years ago

@krummas

First, thank you for the PR, any contribution to this project is appreciated. We are very happy that stream-lib is a part of Cassandra and the idea of storing registers off-heap is a good one.

Second, Please allow me to personally apologize for the rudeness of the comments in this thread. That type of response is completely unacceptable and not something we condone. I am embarrassed and I hope you can accept this apology.

This is a big change and it will require some testing and analysis before we can merge in. Give us some time to work through the proposed changes and we'll get back to you when we have more questions or alternative approaches.

Thanks, Matt Abrams

mspiegel commented 8 years ago

Thanks for the pull request! Here are some initial questions:

Thanks!

@krummas https://github.com/krummas

First, thank you for the PR, any contribution to this project is appreciated. We are very happy that stream-lib is a part of Cassandra and the idea of storing registers off-heap is a good one.

Second, Please allow me to personally apologize for the rudeness of the comments in this thread. That type of response is completely unacceptable and not something we condone. I am embarrassed and I hope you can accept this apology.

This is a big change and it will require some testing and analysis before we can merge in. Give us some time to work through the proposed changes and we'll get back to you when we have more questions or alternative approaches.

Thanks, Matt Abrams

— Reply to this email directly or view it on GitHub https://github.com/addthis/stream-lib/pull/103#issuecomment-194845536.

krummas commented 8 years ago

This is for a new feature in Cassandra - with our earlier usage of stream-lib we deserialize the registers from disk whenever we need them, with this new feature we will be using them a lot more and need to keep them in memory all the time.

I will try to do a benchmark - we will potentially have 1000+ long lived registers in memory (one per sstable) and do many merges, will try to get something done early next week or so.

yuesong commented 8 years ago

Hi Marcus,

Thanks for your patience while we review the PR. We are a big user of Cassandra ourselves, so we are very happy to do our part to support this feature. Here are some of my thoughts on the PR:

With all that, here's my suggestion:

With these two changes, you have the option now to create custom RegisterSet implementations in your application. You can pass them to the HLL and HLLP constructors. In your case for Cassandra, for example, you can implement an OffHeapRegisterSet (using Unsafe or IntBuffer). It will have a destroy method where it frees its memory. Cassandra code will keep references to all the unsafe register sets and call their destroy methods at the appropriate time. This way the core of the library is not affected while you get the flexibility to do the off-heap customization.

Consider adding serialize/deserialize methods to the RegisterSet interface. That way we may not have to worry about changing the version number. That should make the code simpler.

Let me know what you think.

Yuesong

krummas commented 8 years ago

sounds good! will update the patch as soon as possible