richardartoul commented 5 years ago

Description

The existing implementation of the Union function is very expensive (terms of CPU cycles and allocations) if many different bitmaps need to be unioned together.

Success criteria (What criteria will consider this ticket closeable?)

This ticket would be considered closeable if we were able to make significant improvements in terms of reducing CPU cycles and allocations when unioning together many bitmaps.

Additional Context

We're currently using Pilosa's Roaring bitmap implementation as well as the "roaring" Roaring Bitmap implementation in our open source time-series database M3DB. We want to move away from using two different roaring bitmap implementations because converting between them at query time is expensive and allocation heavy, however, when we tried to switch to using only the Pilosa implementation, we found that the Union operation was far too CPU and allocation heavy when its used to union together many different bitmaps at once which is required by our use-case.

We're proposing that we take on the work of improving the Union functions performance (either by modifying the existing function or creating a new one) so that we can achieve performance that is at least comparable to the roaring implementation which has a FastOr implementation that is (relatively) efficient at unioning together many different bitmaps.

If the Pilosa team is open to us contributing this work, we have a few questions:

1.) Does a function/method signature along the lines of

func(b *bitmap) Union(bitmaps ...*bitmap) *bitmap

seem reasonable? We may be able to achieve that by modifying the existing function and not adding a new one which would be nice since I believe this would be a non-breaking interface change.

There may be some potential for additional performance improvements if we know the bitmaps are immutable (as they are in our production database case). Is the Pilosa team open to us adding some kind of distinction to the roaring bitmap implementation to distinguish between immutable and mutable bitmaps? Maybe something along the lines of adding a Seal method? I'm not 100% sure we will need this until I start actually getting my hands dirty, but thought it might be good to discuss upfront. A potential example is when unioning two bitmap B into bitmap A and you find that bitmap B has a container that bitmap A does not. In that case, it would be much faster to just copy a reference to bitmap Bs container into bitmap A, but that is only safe if bitmap B is immutable and that container in bitmap A is marked as immutable.
Finally, while with a new implementation of the Union function that accepts multiple bitmaps at once we should be able to significantly reduce the number of allocations that take place when unioning together many different bitmaps, ideally we would be able to pool some of the underlying datastructures to prevent having to allocate slices all the time for short-lived bitmaps whose lifecycle is only as long as they're required to serve a given query. It looks like there is some basic support for this in the form of the Reset method on the containers which would allow us to re-use the [] *Container slice, but ideally we'd also be able to re-use some of the actual containers themselves so we don't have to re-allocate their underlying slices as well each time. Is this something the Pilosa team would consider letting us add?

Looking forward to hearing your thoughts and feedback, and thanks for the great open source library!

seebs commented 5 years ago

The Seal method is interesting to me, because I'd been pondering the creation of a copy-on-write mechanic for similar purposes. So in that case, the behavior would basically be (1) mark the container as copy-on-write, (2) keep multiple pointers to the container, (3) anyone who wants to write to them would then be expected to copy them, roughly. And I think that would help for not needing to allocate as many containers.

jaffee commented 5 years ago

This sounds awesome, and I think the Union(bitmaps ...*bitmap) seems like a reasonable interface at first glance. If it becomes necessary, I don't think we'd be opposed to adding a new method either - faster large Unions would speed up some really important operations for us, so this could definitely be worth a little API bloat.

Regarding Seal, Reset, etc. We'd welcome any clever allocation reduction tricks with open arms and we will try to prioritize review of any contributions in this area.

Didn't realize M3 was using Pilosa's roaring, but that's awesome!!

richardartoul commented 5 years ago

Great, glad ya'll are excited about it! I spent most of the day yesterday working on it and got something that performs really well in my synthetic benchmarks, going to create an M3DB build that uses my test branch and see how it performs in our test cluster and will report back.

richardartoul commented 5 years ago

ok so the test with the production workload was very exciting. We saw order of magnitude improvements in terms of memory utilization (before the DB nodes would immediately OOM when we started serving queries) and with the new build you can't even see the memory budge. In addition, before it was so slow we basically couldn't serve any queries, whereas now we can serve approximately 10x more queries/s than we could with the previous library. This is without the seal/pooling as well.

I will spend some time today trying to make the code look like it wasn't written by a C programmer and then post it up for review.

jaffee commented 5 years ago

This is super exciting - I will be very interested to see the code, and see if we can apply similar patterns elsewhere (e.g. Difference, Intersect, Xor).

seebs commented 5 years ago

Nothin' wrong with C programmers! :)

richardartoul commented 5 years ago

1766 First draft

May need some more tests, but I think its ready for some feedback!

richardartoul commented 5 years ago

Just tagging this as an additional thing to discuss, if your team doesn't mind adding a dependency, this library doubles the speed at which we can do bitmap repairs / popcounts: https://github.com/tmthrgd/go-popcount

This is not that important when you're unioning together a lot of bitmaps because of the changes I made you only end up needing to popcount the containers once so the cost is fairly well amortized, but if you're just unioning two together it makes a big difference.

tgruben commented 5 years ago

I support adding go-popcount, looks like a nice performance boost

richardartoul commented 5 years ago

Cool I'll add a P.R for go-popcount once #1766 lands since I'll need to rebase off that.

While I'm waiting for review on that, I started investigating the pooling stuff and wanted to solicit some feedback on the design.

What I'm thinking right now is turning this

type sliceContainers struct {
    keys          []uint64
    containers    []*Container
    lastKey       uint64
    lastContainer *Container
}

into this

type sliceContainers struct {
    keys          []uint64
    containers    []*Container
    lastKey       uint64
    lastContainer *Container

    containersPool *containersPool
}

type containersPool struct {
    containers  []*Container
    maxCapacity int
}

func (cp *containersPool) put(c *Container) {
    if cp == nil {
        // Ignore if pooling isn't configured.
    }

    if len(cp.containers) >= cp.maxCapacity {
        // Don't allow the pool to exceed its configured capacity.
        return
    }
    cp.containers = append(cp.containers, c)
}

func (cp *containersPool) get() *Container {
    if cp == nil {
        // Allocate if pooling isn't configured.
        return NewContainer()
    }

    if len(cp.containers) > 0 {
        // If we have a pooled container available, use that.
        c := cp.containers[len(cp.containers)-1]
        cp.containers = cp.containers[:len(cp.containers)-2]
        return c
    }

    // Otherwise allocate.
    return NewContainer()
}

This also requires modifying the Reset method as follows:

func (sc *sliceContainers) Reset() {
    sc.keys = sc.keys[:0]

    for i := range sc.containers {
        // Try and return the containers to the pool if possible
        sc.containersPool.put(sc.containers[i])
        // Clear pointers to allow G.C to reclaim objects if these were the
        // only outstanding pointers.
        sc.containers[i] = nil
    }

    sc.containers = sc.containers[:0]
    sc.lastContainer = nil
    sc.lastKey = 0
}

and then replacing calls to NewContainer() with calls to sc.containersPool.get()

That way the change doesn't affect any existing users, but if they want to opt in they can use a new constructor, something like:

func NewBitmapWithPooling(poolingConfig PoolingConfiguration) *Bitmap

That all seems to work well, but I'll also need to add some kind of Reset method to the containers themselves. Thats not too difficult, but its not clear to me how I should handle containers where the mapped field is set to true. Could I get some explanation on how that field is set / works and what the right logic for handling it would be? My guess is that for anything that is being pooled we'll want to make sure the container remains unmapped. Is that correct?

seebs commented 5 years ago

I believe "mapped" is only ever set when loading a bitmap from a memory-mapped file, any bitmap we create is unmapped, and I don't think a bitmap ever becomes mapped if it didn't start out mapped, but don't trust me on that one. Any alteration to a container should unmap it.

richardartoul commented 5 years ago

@seebs That makes sense, but the mapped field is on the Container struct. Is there an individual mmap per-container? Or I guess its one mmap per bitmap, but because containers can be modified individually you have to track it at a container level.

Ok that makes sense to me, as long as I make all my pooled containers with mapped = false everything should work properly, and then as an extra precaution I can call unmap on containers before returning them to the pool

seebs commented 5 years ago

Yup, that's the trick; the mmaps are bitmap-level, but individual containers don't always stay aligned with the data that was read in with mmap.

richardartoul commented 5 years ago

Just pinging to see if there is any update on getting my P.R merged. Happy to make more changes if necessary, but I think its in a relatively good place.

There are a few more perf improvements that I'd like to contribute once this is merged, and it would be nice to not have to temporarily maintain a Pilosa fork for M3DB's next release although I understand your team has their own priorities and deadlines.

seebs commented 5 years ago

Hi! Sorry about the delay, there's a lot here to look over. Had another read through it, made a couple of comments. I'll check with people.

richardartoul commented 5 years ago

@seebs Cool I'll address all your feedback today. And no worries I understand, we really appreciate you helping us out.

richardartoul commented 5 years ago

Thanks for the merge! I'll get started on the other perf stuff soon.

richardartoul commented 5 years ago

... this is a little embarrassing but I introduced a bug into my UnionInPlace implementation during the refactoring I did when my P.R was being reviewed.

I have a fix for it here: #1774

Note that it only affects the new code path, so any existing code that is calling Union with a single argument is unaffected.

In addition, I added a property test that should catch any regressions like this in the future and make the code much more robust when performing refactors (the new test would have caught this issue for example.)

richardartoul commented 5 years ago

Happy new year! Just pinging to say I'm going to begin working on the pooling/reset code

richardartoul commented 5 years ago

Threw up my first draft: https://github.com/pilosa/pilosa/pull/1809

Let me know what you think of the design / implementation and we can take it from there.

richardartoul commented 5 years ago

Just pinging to see if you've had a chance to review the design / implementation!

jaffee commented 5 years ago

hey @richardartoul - thanks for all your work on this. We're pretty swamped, but we'll get someone to look at this pretty soon

seebs commented 5 years ago

Hi! I'm looking at this a bit. I have a couple of questions about the design.

First question: Why is the container pool per-bitmap? My first assumption would have been a single shared pool that was used by multiple bitmaps, or possibly just by all bitmaps using pooling. This seems like it would bypass the need for the Reset operation and associated complexity.

Second question: You are allocating all three backing stores for each thing. Did you consider having those be three separate pools? e.g., to convert from a bitmap to an array, you could grab an array-type container from the array pool, copy the data, dump the bitmap-type container into the bitmap pool and... oh. so i was going to say "and then update the pointer in the slice/heap of containers", but of course, we don't know where our pointer is, and adding a **Container for "pointer to update when replacing" seems maddeningly complex and vulnerable to errors and also bad for GC performance.

But I figure I'll mention the idea anyway in case you see an obvious way to fix that problem.

I haven't had a chance to go over it more carefully yet.

richardartoul commented 5 years ago

Yeah so my current implementation is useful in the sense that it allows you to manually pool a small number of bitmaps and reuse the containers on them by calling Reset() on the bitmap. Its also (relatively) simple in implementation.

Another implementation would be to use something like sync.Pool to have a bunch of package-level pools that are just managed transparently to the user. This is nice for retrofitting existing libraries without breaking callers or asking them to make changes, but doesn't offer much control.

The final, and probably most flexible, option would be to perform what one of my coworkers brought up in this issue a long time ago: #1630 which is to basically allow users to supply a container pool to bitmaps when they're instantiated. That way if a user wants to pool the bitmaps they can, but they can also pool the containers that the bitmaps allocate as well by optional passing a "container allocator" function to their bitmaps. More complicated, but probably a lot more flexible.

Let me know what you think!

seebs commented 5 years ago

I guess my first question would be roughly: under what circumstances does the control of pooling offer much additional value? I don't immediately see cases where I benefit from isolating the container pools used by two bitmaps. It seems like containers are, generally, interchangeable. Except maybe in the case where a container is switching types.

This is where i sort of want a union data type; we know that array and bitmap containers both contain at most 8KB of storage, so they could in principle share that storage. Which would reduce a lot of the potential excess memory usage from the pooled containers switching types. (Although, hmm, you still need to have both exist at once to convert from one to the other.)

Okay, ignoring for a moment the amount of unsafe needed to pull it off: Consider if you will a container pool, but also an 8KB storage block pool. When converting a bitmap to an array or vice versa, you grab a new suitably-converted member of the 8KB storage block pool, and move the data into that, then return the previous data to the block pool.

I'm definitely convinced that you've demonstrated significant value to a container pool, I'm still waffling on what would make it effective. I think the biggest risk right now is that since pooling can imply higher memory usage (both array and bitmap allocated), using it for absolutely everything could be bad. But if we could bypass that...

jaffee commented 5 years ago

I'm gonna go ahead and close this ticket since the UnionInPlace stuff is in and seems to be working nicely. We can open something else to continue the pooling discussion.

FeatureBaseDB / featurebase

Proposal: Improve performance of Union function when unioning many bitmaps #1765

Description

Success criteria (What criteria will consider this ticket closeable?)

Additional Context

1766 First draft