gaffneyc / sunspot-queue

Background search indexing using existing worker systems
MIT License
35 stars 37 forks source link

Support for concurrency throttling and document batching #20

Open nz opened 9 years ago

nz commented 9 years ago

Solr processes updates most efficiently when it's receiving documents in batches at a relatively low concurrency. Highly parallel single document updates are probably one of the easiest ways to starve the CPU and send Solr into a GC death spiral.

Some support for limiting the concurrency would be welcome. Buffering updates into batches would be even better. The implementations here may vary in difficulty (or possibility) based on the background job system in question.

For example, in Sidekiq, a simple Mutex could do the trick. That's the easiest use case here to limit concurrency to one Solr update job per Sidekiq process, though it's a bit harsh. Its easy baked in support for connection pooling can help loosen that up a bit. Better still, individual jobs could append to an in-memory Queue (or re-enqueued back into another Redis list), which can then be processed in batches by a single Sunspot worker thread forked from the class.

We see this particular use case all the time over at websolr.com, so I'd be happy to help support some efforts to get nicer batched queuing out into the real world for Sunspot users. This gem looks like the best bet for that! I don't have time right this second to contribute code, but I can certainly answer questions and help sketch out the requirements and design.