Use Elasticsearch bulk indexing

dpford commented 9 years ago

WARNING: This is a breaking update, which requires a slight tweak to processor files. Those PRs (which I'll also be making) should be merged in at the same time as this update, and will be listed here shortly.

How one developer from Washington, DC cut his Elasticsearch indexing time in half. Sysadmins hate him!

Two main changes:

Update version of py-elasticsearch

According to the docs, we have been using the wrong version. Since we're all using Elasticsearch > 1.0, we need to be using a 1.X.X version of the python elasticsearch wrapper.

Please pip install -r requirements.txt

Switch to using bulk indexing

Previously, we were making a request for each document we were indexing. With this PR, we now use bulk indexing to do it in one go (to be exact, in larger chunks, but the same idea applies). Faster, more efficient.

We're also now using the index method instead of A) trying the create method, then B) falling back on the update method if the document is already there. index automatically updates the document if it's already there.

Review: @rosskarchner @kurtw

kurtrwall commented 9 years ago

http://blog.99jobs.com/wp-content/uploads/2014/07/really-nice-o.gif

rosskarchner commented 9 years ago

ohhh, nice

willbarton commented 9 years ago

:+1:

cfarm commented 9 years ago

Is this change going to production before June 29th? We are planning a release that depends on sheer at that date, with subsequent releases in the 3 weeks after that. @rosskarchner

cfpb / sheer