Performance improvement: Construct documents in memory before indexing

melissachang commented 5 years ago

Current:

1. for each table
2.   create `docs_by_id` generator
3.   call `bulk()` with `docs_by_id` generator. This will index 500 documents at a time

Say there are 1000 participants and 10 tables. Step 3 will run 10 times: the first time, 1000 documents will be created. The subsequent 9 times, the 1000 documents will be updated (new fields added).

Elasticsearch doesn't handle updates well. From here:

Lucene performs copy on write when updating and deleting a document. It means the document is never deleted from the index. Instead, Lucene marks the document as deleted and creates another one when an update is triggered.

This copy on write has an operational consequence. As you’ll update or delete documents, your indices will grow on the disk unless you delete them completely. One solution to actually remove the marked documents is to force Lucene segments merges.

I believe this causes the subsequent 9 updates to get slower and slower.

Proposed:

1. for each table
2.   Update `docs_by_id` dict (not generator)
3. call `bulk()` with `docs_by_id` dict. This will index 500 documents at a time

This should fix #78. Participant ENCDO000AAD has 23k files. So ENCDO000AAD document is being updated up to 23k times. With proposed solution, the document is created once and never updated.

The downside is that we use more memory, as we are storing all the documents in a dict, instead of using a generator. We should test this out on NHS (2G) and UKBB (7G). It's ok if it's no longer possible to index those on local machines. It's also ok if we need an expensive machine type on GKE, since indexing is temporary.

This won't help NHS and UKBB since they are just one table. But it may help Baseline and should help Encode.

This will simplify our samples code; scripts are no longer needed (scripts are only needed for updating nested documents).

The first step would be to try this out on Baseline internal and see if it makes indexing faster.

wnojopra commented 5 years ago

The whole run took ~6 hours and needed ~23.5 GB of memory for the dict. Thats not a good sign, but there is value in noting that the _bulk requests at the end took less than a minute in total.

I think by the end of today we can optimize a bit better about what is stored in the dict, and we can really get this thing going fast.

melissachang commented 5 years ago

Reopening because when I run on GKE in nhs-explorer-staging or biobank-explorer-staging, bq-indexer OOMs pretty quickly. Also, es-data has warnings.

quvzwbrqdrz

I wonder if we're being inefficient with memory somewhere.

Or it could be that bq-indexer doesn't have much memory to work with because it's in the same pool as elasticsearch. deploy-indexer.sh could create a new pool (and delete it afterwards). I can look more next week.

wnojopra commented 5 years ago

Can I get access to the project?

wnojopra commented 5 years ago

While our first attempt at storing everything in memory works well for smaller datasets, it doesn't scale well in general.

One key insight we had was that document updates happen per participant. If we can bulk upload all the data for one participant at a time, we minimize the amount of slow updating we have to do.

The more straightforward approach to achieve this is to group/order each table export by participant_id, and do the iterated bulk indexing from this.

For each sample table:
- Run a SQL query which orders the results by participant id
- Iterate over SQL query result rows with Python generator. So only the ES docs -- including sample docs -- for one participant is in memory at any given time.

A potentially better way to do this is to join the tables in bigquery before exporting. This would minimize the amount of updates you need to do, but writing a dynamic query to do this may be tricky.

Also we should look into using the newer api to query the tables. We currently export a json file to GCS: https://cloud.google.com/bigquery/docs/reference/storage/

And to testing this, Melissa has a good copy of a dataset in melchang-test-project-3:encode

melissachang commented 5 years ago

Here's where Data Explorer API server generates SQL for selected cohort. The code is difficult to read. Let's try the first solution; I think the code will be easier to read.

Note that only sample tables need to be grouped by participant ids. Participant tables can be handled how they're currently handled.

DataBiosphere / data-explorer-indexers

Performance improvement: Construct documents in memory before indexing #127