armon / bloomd

C network daemon for bloom filters
http://armon.github.io/bloomd
Other
1.24k stars 112 forks source link

Memory consumption optimization #14

Closed FGRibreau closed 11 years ago

FGRibreau commented 11 years ago

Hi,

I've been using bloomd in production since yesterday and I must say I'm impressed by its stability and low CPU consumption. You did a really good job there, congrats!

However, I've got some questions regarding memory consumption. Currently bloomd memory consumption is constantly increasing (RES: 106M, SHR: 105M, VIRT 243M).

Here are my bloom filters after one day.

f1 0.000100 300046 100000 91793
f2 0.000100 105575294 34100000 13919873
f3 0.000100 300046 100000 72710
f3 0.000100 1509656 500000 291040
f4 0.000100 1509656 500000 124098

Note that they are going to increase like that at nearly constant rate. And since Scalable Bloom Filters work by adding new bloom filters when the size ratio is reached, the memory consumption will indefinitely increase.

I'm not an expert in C, but I wondered if you could update the readme to give some input on how the "Automatically faults cold filters out of memory to save resources" feature works, in order to take advantage of it.

If I understand it well, since here my filters won't be ever cold (new data is added constantly), I thought maybe I could create filters with composed name like "f{filterid}{weekoftheyear}{year}" where "{weekoftheyear}{year}" are informations extracted and available from every data that the filters test against. That way, filters with older data could be removed from memory but still available just in case.

Is this the right approach? What do you think?

armon commented 11 years ago

Hey Francois,

I'm glad to hear it is working well for you! You are correct in your thinking about how the filters will work, and in the current setup your memory use will continue to grow unbounded.

The automatic cold filter faulting is pretty simple. In the config it is possible to specify a value called cold_interval, which defaults to 3600 seconds. Basically, if a filter is not accessed
(no checks / sets) for this interval, then it is removed from memory and kept on disk.

The way we setup our filters at Kiip, is they are named something like:

So we have filters like:

This way, as you suggested, eventually it is possible for the filters to go cold. Once the day is over, all of our daily filters get faulted out automatically, same with the month, etc.

This is basically the same as what you suggested, so I expect that it will work quite well for you!

Let me know if you have any other questions.

Best Regards,

Armon Dadgar

On Friday, April 5, 2013 at 3:36 AM, Francois-Guillaume Ribreau wrote:

Hi, I've been using bloomd in production since yesterday and I must say I'm impressed by its stability and low CPU consumption. You did a really good job there, congrats! However, I've got some questions regarding memory consumption. Currently bloomd memory consumption is constantly increasing (RES: 106M, SHR: 105M, VIRT 243M). Here are my bloom filters after one day. f1 0.000100 300046 100000 91793 f2 0.000100 105575294 34100000 13919873 f3 0.000100 300046 100000 72710 f3 0.000100 1509656 500000 291040 f4 0.000100 1509656 500000 124098

Note that they are going to increase like that at nearly constant rate. And since Scalable Bloom Filters works by adding new bloom filters when the size ratio is reached, the memory consumption will indefinitely increase. I'm not an expert in C, but I wondered if you could update the readme to give some input on how the "Automatically faults cold filters out of memory to save resources" feature works, in order to take advantage of it.
If I understand it well, since here my filters won't be ever cold (new data is added constantly), I thought maybe I could create filters with composed name like "f{filterid}{dayoftheweek}{year}" where "{dayoftheweek}{year}" are informations extracted and available from every data that the filters test against. That way, filters with older data could be removed from memory but still available just in case. Is this the right approach? What do you think?

— Reply to this email directly or view it on GitHub (https://github.com/armon/bloomd/issues/14).

FGRibreau commented 11 years ago

Thanks for your feedback!

I updated my code (with filters like .DD-MM-YYYY) and set initial_capacity to 10001 in order to keep memory as low as possible and it worked!

I'll keep you posted if anything weird happens

Cheers

armon commented 11 years ago

Great glad it worked! I would advise against just setting initial capacity to the smallest possible value. Due to the way the scalable filters work (stacking multiple bloom filters), as then number of filters grows it will get slower (marginally, but still). It is better if you can select a capacity that you think will be enough right off the bat.