chanzuckerberg / czid-web

Infectious Disease Sequencing Platform
https://czid.org/
MIT License
79 stars 24 forks source link

Async Elasticsearch #3355

Closed morsecodist closed 4 years ago

morsecodist commented 4 years ago

Description

This PR fixes issues that caused missing records in elasticsearch. It also makes our elasticsearch indexing asynchronous.

Issues

These issues occurred because we add things to Elasticsearch with [elasticsearch-model[(https://github.com/elastic/elasticsearch-rails/tree/master/elasticsearch-model), which relies on Active Record Callbacks. Any modifications that circumvent the Active Record Callback Flow result in missing data in Elasticsearch.

Taxon Lineages

These are only updated via the task update_lineage_db. About a year ago this task was modified to use raw SQL. We have been missing updates since then. To fix this I added a step to re-index after this job runs. It is a little slow but we can index the whole thing in <30 minutes and this task only runs once ever 28 days so I feel it isn't a huge deal.

Metadata Bulk Import

To do bulk metadata importing we are using https://github.com/zdennis/activerecord-import. This circumvents the Active Record Callback Flow. This is necessary to avoid blocking synchronous updates that would render the feature very slow. I created a version of this method that also indexes in Elasticsearch. However, this would still involve blocking synchronous updates without the async change I made.

Async Elasticsearch

Currently we are blocking considering writes complete on updating the Elasticsearch Index. This slows down all of our database interactions with these tables. It also fails database operations when the re-index is not really required right away and could have been retried. In their docs elasticsearch-model recommends async updates. These callbacks also have logging and alerting so we can debug potential missingness in Elasticsearch in the future, and we can know when to initiate a re-index.

Tests

Tested all of the relevant models, bulk imports, as well as the rake task.

gregdingle commented 4 years ago

Also I'm wondering what is the plan to update the existing prod and staging DBs.

morsecodist commented 4 years ago

Also I'm wondering what is the plan to update the existing prod and staging DBs.

We can just run the import commands at any time after shipping this.