Faster set intersection

derwiki commented 13 years ago

There's a few algorithms that are supposed to be faster than O(m+n) time for a set intersection. Most of this are in academic papers that take 10 pages to describe what surely would have been only 1 page in pseudocode, but.. that's academia.

http://fawx.com/2009/10/26/an-ode-to-set-intersection-part-1/ Guy who implemented the Baeza-Yates fast set intersection algorithm including: https://github.com/erikfrey/themas/blob/master/src/set_intersection/intersect.hpp his code.

johnfn commented 13 years ago

Are you planning to look into this soon? I might check it out sometime - you know, just for fun :P

derwiki commented 13 years ago

I wasn't, I've got some segfaults I was looking into fixing.

You might want to talk to sam@causes.com, he had some ideas on using bit arrays to condense the membership array footprint. chris@causes.com also hinted that he was knowledgable about the Baeza-Yates algorithm.

ghost commented 13 years ago

I'm thinking that for the Suggestomatic, you don't need perfect set intersection. It only needs to be approximate.

The first thing that springs to mind is the use of a Bloom Filter. These are very compact, needing 9.6 bits per set-element to have a 1% false-positive rate. There's a decent looking Java implementation at:

http://code.google.com/p/java-bloomfilter/

and the Wikipedia write-up isn't bad:

http://en.wikipedia.org/wiki/Bloom_filter

The idea is, for each cause, you create a bloom filter for its user set. When comparing two causes, you then just check if one cause's bloom filter 'contains' each of the members of the other cause. This should be O(m) (m being the size of the set being iterated over).

... but wait, it gets better. One can create 'intersection' bloom filters by bitwise &-ing the internal bit arrays of two bloom filters. It seems to me that one may be able to take advantage of this property to create a similarity function that analyses the bit array of the intersection bloom filter in order to produce a score. Only problem with this is that it only works for bloom filters of the same size, so one would need to use unnecessarily large bloom filters for most causes.

derwiki commented 13 years ago

Bloom filters eh? Yea, you'll fit in pretty well at Causes :)

Bloom filters are one of my favorite data structures, but I always find it hard to justify using them. grant@causes.com is also interested in applying bloom filters to the set intersection problem. The reasons I haven't pushed hard for bloom filters:

You'd still need to keep the list of users around since you can't enumerate set membership from a bloom filter. So whatever we currently have + how much space it would take to do the bloom filters. And we'd wan't to make sure it all still fits into memory, or potential speedups would be lost.
They would only be substantially faster if set_a << set_b and set_b was a bloom filter. That might be a reason to implement it though -- maybe some threshold, sets > THRESHOLD get a bloom filter for comparisons

All that said, I'd love to see some experimenting and numbers for a bloom filter backed fast set intersection. Do you think that would be more worthwhile than something like Baeza-Yates? That one has the advantage of not needing a new data structure.

ghost commented 13 years ago

Hi Adam, I think this is one of those questions that is tricky to answer without measurements. Creation of a Bloom filter is O(n), but you only need to do it once for each larger set, as long as you can keep it in memory. Then for a comparison using the bloom filter, the complexity is O(m), where m is the size of the smaller set.

What I've been reading ( http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.7899&rep=rep1&type=pdf) is that the BY algorithm is worse-case O(n) (where n is the larger of the two sets), and best case O(log(n)log(m)). However, it also works well when the sets are different sizes, so the Bloom filter may or may not have an advantage in this case.

I think Baeza-Yates is the go in general, and then one might implement the Bloom filter optimization that you spoke of if needed. I've seen a decent Bloom filter implementation, and they're pretty easy to write, but I wouldn't bother until I had some numbers on Baeza-Yates first. Who knows; the Bloom filter optimization may make things slower.

Regards,

Kim.

On Fri, Jul 29, 2011 at 2:24 AM, derwiki < reply@reply.github.com>wrote:

Bloom filters eh? Yea, you'll fit in pretty well at Causes :)

Bloom filters are one of my favorite data structures, but I always find it hard to justify using them. grant@causes.com is also interested in applying bloom filters to the set intersection problem. The reasons I haven't pushed hard for bloom filters:

You'd still need to keep the list of users around since you can't enumerate set membership from a bloom filter. So whatever we currently have

how much space it would take to do the bloom filters. And we'd wan't to make sure it all still fits into memory, or potential speedups would be lost.

They would only be substantially faster if set_a << set_b and set_b was a bloom filter. That might be a reason to implement it though -- maybe some threshold, sets > THRESHOLD get a bloom filter for comparisons

All that said, I'd love to see some experimenting and numbers for a bloom filter backed fast set intersection. Do you think that would be more worthwhile than something like Baeza-Yates? That one has the advantage of not needing a new data structure.

Reply to this email directly or view it on GitHub: https://github.com/causes/suggestomatic/issues/1#issuecomment-1674314

Kim Mason kim.mason@kimandsally.com

derwiki commented 13 years ago

The current bottleneck in my https://github.com/causes/suggestomatic/commits/help_small_sets branch is that it chokes on large set_a's (on the the order of 1.4m and higher). I think that if set_a hits a threshold (maybe 100k members) then it might make sense to use a bloom filter for set_a and have set_b's items compared against it.

I think a combination of BY and bloom filters might be the way to go, but this is begging for data.

civiccc / suggestomatic

Faster set intersection #1