Closed simonw closed 3 years ago
Here's what we have so far: https://vial-staging.calltheshots.us/api/searchLocations?size=20&format=v0preview
{
"usage": {
"notice": "Please contact Vaccinate The States and let us know if you plan to rely on or publish this data. This data is provided with best-effort accuracy. If you are displaying this data, we expect you to display it responsibly. Please do not display it in a way that is easy to misread.",
"contact": {
"partnersEmail": "api@vaccinatethestates.com"
}
},
"content": [
{
"id": "recAeW9YJKj2nslIV",
"name": "V N PHARMACY",
"state": "CA",
"latitude": 34.06234,
"longitude": -118.08878,
"location_type": "Pharmacy",
"phone_number": "626-307-2711",
"vaccines_offered": null,
"full_address": "8244 E GARVEY AVE STE B, ROSEMEAD, CA 91770",
"city": null,
"county": "Los Angeles",
"zip_code": null,
"hours": {
"unstructured": "Monday - Tuesday: 10:00 AM – 6:00 PM\nWednesday: Closed\nThursday - Friday: 10:00 AM – 6:00 PM\nSaturday: 10:00 AM – 2:00 PM\nSunday: Closed"
},
"website": null,
"concordances": [
"google_places:ChIJlQ2qLAvQwoARBGIUaM8HBGg"
],
"last_verified_by_vts": "2021-04-02T18:28:01+00:00",
"vts_url": "https://www.vaccinatethestates.com/?lng=-118.08878&lat=34.06234#recAeW9YJKj2nslIV"
}
]
}
The missing feature right now is that vaccines_offered
needs to be populated using the hack from #504. I'm going to try and use an expansions-style mechanism for this, see #557.
Then we need tests for this all, and finally a mechanism for actually pushing it to a GCS bucket.
I implemented streaming upload in #554 last night, and changed the API to take Iterator[str]
so it should be reasonably straightforward to plug those together.
I implemented the horrible vaccinefinder hack in #557 - need to get the export to GCS working next.
Right now doing POST /api/export
causes the exports to be generated and uploaded for the api.vaccinateca.com
stuff.
I'm concerned that the new export for api.vaccinatethestates.com
will take significantly longer, since it's a larger set of locations. So I'm not going to add it to the /api/export
endpoint.
Instead I'll set up POST /api/exportVaccinateTheStates
to trigger this export.
I think the neatest way to run this may be to have the export endpoint make an internal call to the /api/searchLocations
view under the hood, since that already returns a streaming data structure which an be passed to the new streaming-to-GCS code Alex wrote.
Here's the code I wrote for simulating an internal request to /api/searchLocations?all=1&format=v0preview
and then iterating over and writing the results to storage: https://github.com/CAVaccineInventory/vial/blob/edb24c47739c4ae421003b6b77d071f5c4ac40fa/vaccinate/core/exporter/__init__.py#L43-L59
I had to add mechanisms for skipping API logging and skipping JWT auth to get this to work.
Tests there passed in GitHub Actions but failed in Cloud Build with this error:
Step #9 - "test image": ERROR django.request:log.py:224 Internal Server Error: /api/exportVaccinateTheStates
Step #9 - "test image": ------------------------------ Captured log call -------------------------------
Step #9 - "test image": core/exporter/__init__.py:49: KeyError
Step #9 - "test image":
Step #9 - "test image": E KeyError: 'unknown'
Step #9 - "test image": > writer = VTS_DEPLOYS[os.environ.get("DEPLOY", "testing")]
Step #9 - "test image": response = search_locations(request)
Step #9 - "test image": request.skip_api_logging = True # type: ignore[attr-defined]
Step #9 - "test image": request.skip_jwt_auth = True # type: ignore[attr-defined]
Step #9 - "test image": request.user = AnonymousUser()
Step #9 - "test image": request = RequestFactory().get("/api/searchLocations?all=1&format=v0preview")
Testing it by running this:
curl -XPOST 'https://vial-staging.calltheshots.us/api/exportVaccinateTheStates' -d ''
Took a while (I should have timed it) and returned an ok
status but https://console.cloud.google.com/storage/browser/vaccinateca-api-vial-staging does not show the new file.
I tried doing this too:
curl -XPOST 'https://vial-staging.calltheshots.us/api/export -d ''
To trigger the current export code. That also churned for a while before returning an ok
response, but likewise it didn't seem to have updated the JSON in those buckets.
Meanwhile running this:
curl -o locations.json\
'https://vial-staging.calltheshots.us/api/searchLocations?all=1&format=v0preview' \
-H 'Authorization: Bearer 24:a1da70...'
Produced a 31MB locations.json
file which appears to have the correct content. I've uploaded that file here if anyone wants to have a look:
https://static.simonwillison.net/static/2021/vial-staging-v0preview-locations.json
I think I can improve performance a bunch by ditching the internal HTTP request trick, because then I can use the version of the expansion which loads all of the vaccine finder data in one go.
I think I can improve performance a bunch by ditching the internal HTTP request trick, because then I can use the version of the expansion which loads all of the vaccine finder data in one go.
https://ui.honeycomb.io/vaccinateca/datasets/vial-staging/result/HQwyAyHDYWw/trace/ffunx8PnUTP shows O(n) queries into source_location
-- is that what you mean?
Took a while (I should have timed it) and returned an
ok
status but https://console.cloud.google.com/storage/browser/vaccinateca-api-vial-staging does not show the new file.
Ah, confusion -- that was a short-term bucket used to test VIAL output when we were cutting over for airtable-export. The api.vaccinateca.com bucket is https://console.cloud.google.com/storage/browser/vaccinateca-api-staging
I deployed that optimization to production just now.
Here's a trace from before the optimization: https://ui.honeycomb.io/vaccinateca/datasets/vial-production/result/AEe2VaC1rDj/trace?trace_id=a23d6760ad921e1a9ae3777af8b183f2 - took 123 seconds and ran probably around 65,000 SQL queries (one per returned location).
After deploying the optimization:
(vial) vial % time curl -XPOST 'https://vial.calltheshots.us/api/exportVaccinateTheStates' -d ''
{"ok": 1}0.01s user 0.01s system 0% cpu 1:06.40 total
65s. Still a lot of traces, because I'm running a query per batch of 100 returned rows - and it returns 68,000 rows so likely executes 680 queries.
Trace: https://ui.honeycomb.io/vaccinateca/datasets/vial-production/result/nT4E1WMMrGE/trace/pXbDF1yRMBq
~ % curl 'https://api.vaccinatethestates.com/v0/locations.json' | jq '.content | length'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 44.7M 0 44.7M 0 0 27.0M 0 --:--:-- 0:00:01 --:--:-- 27.0M
68327
~ %
I'm tempted to bump the default batch size up to 1000 here: https://github.com/CAVaccineInventory/vial/blob/7386fc863311582dedb4b501543f2f162b84ec10/vaccinate/api/serialize.py#L27-L29
Still took 63s - from this trace https://ui.honeycomb.io/vaccinateca/datasets/vial-production/result/wesawhore23/trace/4mmgAJNkwEE it's clear that actually it's the prefetch_related()
calls from the keyset pagination that are the limit now. Those happen returning 500 rows at a time so I think that's fine. I'm going to call this "good enough" and close the ticket.
The new published API is at https://api.vaccinatethestates.com/v0/locations.json
Work on this started in #392 but that's about a standard JSON representation, this is specifically about building our first v0 export of all locations (not just CA ones) which is on a tight deadline.
Originally posted by @simonw in https://github.com/CAVaccineInventory/vial/issues/392#issuecomment-839958188