Open melissachang opened 5 years ago
The whole run took ~6 hours and needed ~23.5 GB of memory for the dict. Thats not a good sign, but there is value in noting that the _bulk requests at the end took less than a minute in total.
I think by the end of today we can optimize a bit better about what is stored in the dict, and we can really get this thing going fast.
Reopening because when I run on GKE in nhs-explorer-staging or biobank-explorer-staging, bq-indexer OOMs pretty quickly. Also, es-data has warnings.
I wonder if we're being inefficient with memory somewhere.
Or it could be that bq-indexer doesn't have much memory to work with because it's in the same pool as elasticsearch. deploy-indexer.sh could create a new pool (and delete it afterwards). I can look more next week.
Can I get access to the project?
While our first attempt at storing everything in memory works well for smaller datasets, it doesn't scale well in general.
One key insight we had was that document updates happen per participant. If we can bulk upload all the data for one participant at a time, we minimize the amount of slow updating we have to do.
The more straightforward approach to achieve this is to group/order each table export by participant_id, and do the iterated bulk indexing from this.
A potentially better way to do this is to join the tables in bigquery before exporting. This would minimize the amount of updates you need to do, but writing a dynamic query to do this may be tricky.
Also we should look into using the newer api to query the tables. We currently export a json file to GCS: https://cloud.google.com/bigquery/docs/reference/storage/
And to testing this, Melissa has a good copy of a dataset in melchang-test-project-3:encode
Here's where Data Explorer API server generates SQL for selected cohort. The code is difficult to read. Let's try the first solution; I think the code will be easier to read.
Note that only sample tables need to be grouped by participant ids. Participant tables can be handled how they're currently handled.
Current:
Say there are 1000 participants and 10 tables. Step 3 will run 10 times: the first time, 1000 documents will be created. The subsequent 9 times, the 1000 documents will be updated (new fields added).
Elasticsearch doesn't handle updates well. From here:
I believe this causes the subsequent 9 updates to get slower and slower.
Proposed:
This should fix #78. Participant ENCDO000AAD has 23k files. So ENCDO000AAD document is being updated up to 23k times. With proposed solution, the document is created once and never updated.
The downside is that we use more memory, as we are storing all the documents in a dict, instead of using a generator. We should test this out on NHS (2G) and UKBB (7G). It's ok if it's no longer possible to index those on local machines. It's also ok if we need an expensive machine type on GKE, since indexing is temporary.
This won't help NHS and UKBB since they are just one table. But it may help Baseline and should help Encode.
This will simplify our samples code; scripts are no longer needed (scripts are only needed for updating nested documents).
The first step would be to try this out on Baseline internal and see if it makes indexing faster.