Optimize /api/importSourceLocations

simonw commented 3 years ago

It gets hit with 160,000 writes (batched 500 at a time) every six hours and it hasn't had any optimization work done at all.

Discord: https://discord.com/channels/799147121357881364/824443781990187008/843889279763218452 and https://discord.com/channels/799147121357881364/813861006718926848/843934864397828156

simonw commented 3 years ago

One idea is to have our clients calculate hashes for every imported item, store those in VIAL, and provide a supported mechanism for those clients to maintain their own cache of those hashes - so they can avoid sending us data that we already have.

simonw commented 3 years ago

There is definitely low-hanging fruit within the endpoint itself though: https://github.com/CAVaccineInventory/vial/blob/c33b0d0cfff4847073f9a29bb6f16d5572dd2148/vaccinate/api/views.py#L237-L324

simonw commented 3 years ago

Honeycomb: https://ui.honeycomb.io/vaccinateca/datasets/vial-production/result/t2XAaoUCxeh

vial-production_Queries___Honeycomb_io_and_PostgreSQL__Documentation__13__9_21_ Aggregate_Functions_and_Dynamic_configuration__a_new_plugin_hook_·_next-LI_datasette_3fd8ce9

bryanculbertson commented 3 years ago

I am going to test out doing content hashes within ingestion by downloading all loaded source locations, calculating a content hash e.g. sorted keys with no timestamp fields, and then skipping incoming locations with the same hash value.

If this plan works than I would like a place to store these content hashes on source_location so I don't need to compute then again. But I think I can get a majority of the speed up without VIAL needing to store the content hash.

shashank025 commented 3 years ago

Suggestion: instead of creating content hashes in the application layer, add server logic to leverage Postgres' ETL capability, which basically amounts to:

-- Step 1: temp table with all uploaded data
-- Modify /api/importSourceLocations to insert
-- to this temp table in a streaming fashion
-- (very low memory overhead)
create temp table tmp_new_data as
    ...;

begin;  -- start transactions as late as possible

-- Step 2: update rows that have changed
update destination_table d
   set
    col_a = t.col_a,
    col_b = t.col_b,
    ...
  from tmp_new_data t
 where
    t.pkey_col_1 = d.pkey_col1   
    and ...
    and (t.col_a, t.col_b, ...) IS DISTINCT FROM (d.col_a, d.col_b, ...);

-- Step 3: insert new rows
insert into destination_table d
select * from tmp_new_data t
 where not exists (
    select 1 from tmp_new_data t
     where t.pkey_col_1 = d.pkey_col1 ...
);

commit;

-- Step 4: clean up
drop table tmp_new_data;

Sorry for "drive-by" design, but I'm happy to collaborate more on this. I've written a bunch of ETL in past lives, and this seems like basically that.

shashank025 commented 3 years ago

The idea is that Postgres (and RBMSes in general) do(es) really well when operated in bulk fashion (as opposed to inserting/updating one row at a time), and the above suggestion does exactly that.

bryanculbertson commented 3 years ago

@shashank025 The idea of bulk loading to a temp table and then selectively copying to the main table is a great idea. This would speed up bulk loading in general. We would probably still want content hashes though to avoid having to do multiple comparisons for each field, especially since we would need to do those comparisons to fields inside of a JSON field.

If VIAL authors have time they can look into bypassing the ORM to try this temp table load technique that would be great. When i have done this in the past I use psycopg2's copy_expert to stream data directly from a fileobj to temp table which had a low memory overhead, but that is may be overkill for the <10MB files we have.

In the meantime, here is how it looks when we calculate content hashes in ingestion, I think this solves the broken load problem well enough for now: https://github.com/CAVaccineInventory/vaccine-feed-ingest/pull/661

simonw commented 3 years ago

The problem with dropping down for raw SQL to this is that we then need to keep that SQL in sync with further changes we make to the database models - operating via the ORM protects us from having to think about that.

Keeping in sync isn't impossible though: with comprehensive automated tests around this I'm confident we could at least cause easily fixed test failures should the raw SQL optimization be broken by any future changes.

It's a complex enough solution though that I'd prefer to avoid it unless we urgently need it.

bryanculbertson commented 3 years ago

@simonw Ingest implemented content hashs, and that allowed the load to finish successfully in 4.5 hours. This is crucially under the 5 hour deadline!

However, this is still too close to the deadline for my comfort, so I think we need further optimization to importSourceLocation. Right now it takes about 2 minutes to load 500 source locations which is 240ms per row. Perhaps using the ORM bulk_create would help speed this up while still keeping everything in ORM land? I know this works well in SQLAlchemy, but I don't know the Django ORM.

Another option to increase performance is you could skip Pydantic validation of NormalizedLocation and trust that I am handing you a valid location 😬

simonw commented 3 years ago

OK, sounds like we do need to make some improvements here. I'll look for some low hanging fruit.

simonw commented 3 years ago

Recent /api/importSourceLocations trace: https://ui.honeycomb.io/vaccinateca/datasets/vial-production/result/nLWWNjPCtFR/trace/eQxie317P1N

Looks to me like the main problem here is we're executing hundreds (if not thousands) of tiny 3ms queries.

bryanculbertson commented 3 years ago

The "2 minutes for 500 locations" includes encoding and transfer time on ingestion side. I can look into optimizing that like using orjson

bryanculbertson commented 3 years ago

Ah, I also misinterpreted the logs! This means we are loading 2,500 records (5*500) in about 90 secs, so that is only 35 ms per record. That is much more reasonable.

2021-05-19T17:52:54+0000 INFO:vial.py:Submitted 5 batches of up to 500 records to VIAL.
2021-05-19T17:54:22+0000 INFO:vial.py:Submitted 10 batches of up to 500 records to VIAL.
2021-05-19T17:55:50+0000 INFO:vial.py:Submitted 15 batches of up to 500 records to VIAL.
2021-05-19T17:57:18+0000 INFO:vial.py:Submitted 20 batches of up to 500 records to VIAL.
2021-05-19T17:58:42+0000 INFO:vial.py:Submitted 25 batches of up to 500 records to VIAL.
2021-05-19T18:00:07+0000 INFO:vial.py:Submitted 30 batches of up to 500 records to VIAL.

rhkeeler commented 3 years ago

Is this duplicative of https://github.com/CAVaccineInventory/vial/issues/646 ? Do we need them both open? @simonw

CAVaccineInventory / vial

Optimize /api/importSourceLocations #583