Kixiron / lasso

A fast, concurrent string interner
Apache License 2.0
139 stars 19 forks source link

Allocation failures due to enormous (many TB) allocation attempts #48

Open saethlin opened 5 months ago

saethlin commented 5 months ago

We've been using ThreadedRodeo in production for a while to implement a global string interner. For the most part it works great, but recently when we increased the concurrency of our workload from 10 CPUs to 64 CPUs we started seeing sporadic allocation failures, due to attempts to mmap between 2 TB and a few PB. And even when there isn't a crash, we had absurdly high virtual memory usage, commonly 50 GB of resident set size would come with 2 TB virtual memory usage.

I believe the root cause is that lasso is missing some synchronization in its lockfree allocation strategy. Or alternatively, that the entire design of this lockfree allocation is not viable. The memory for ThreadedRodeo strings in the 0.7 release series is allocated in a lock-free strategy using a linked list of AtomicBucket. This is the responsible code: https://github.com/Kixiron/lasso/blob/aed4fb5f30f53d43a7f5b0b5b83777ff315fa207/src/arenas/lockfree.rs#L174-L190

The ideal behavior of this data structure is for one thread to get here at a time, double the capacity, then stick its allocated slab onto the linked list. But if many threads race to this point in the code, they will not realize that there is another thread allocating (that's the whole point!) and they will double the size in sequence and allocate that doubled size.

If 64 threads race on this code in perfect synchrony, this would attempt to allocate more than the entire address space.

We've worked around this for now by forking lasso and back-porting some changes from 0.7.2 onto 0.6.0.