v0preview exported API of all our locations

simonw commented 3 years ago

Work on this started in #392 but that's about a standard JSON representation, this is specifically about building our first v0 export of all locations (not just CA ones) which is on a tight deadline.

Splitting out ongoing work on v0preview into a separate issue.

Originally posted by @simonw in https://github.com/CAVaccineInventory/vial/issues/392#issuecomment-839958188

simonw commented 3 years ago

Here's what we have so far: https://vial-staging.calltheshots.us/api/searchLocations?size=20&format=v0preview

{
  "usage": {
    "notice": "Please contact Vaccinate The States and let us know if you plan to rely on or publish this data. This data is provided with best-effort accuracy. If you are displaying this data, we expect you to display it responsibly. Please do not display it in a way that is easy to misread.",
    "contact": {
      "partnersEmail": "api@vaccinatethestates.com"
    }
  },
  "content": [
    {
      "id": "recAeW9YJKj2nslIV",
      "name": "V N PHARMACY",
      "state": "CA",
      "latitude": 34.06234,
      "longitude": -118.08878,
      "location_type": "Pharmacy",
      "phone_number": "626-307-2711",
      "vaccines_offered": null,
      "full_address": "8244 E GARVEY AVE STE B, ROSEMEAD, CA 91770",
      "city": null,
      "county": "Los Angeles",
      "zip_code": null,
      "hours": {
        "unstructured": "Monday - Tuesday: 10:00 AM – 6:00 PM\nWednesday: Closed\nThursday - Friday: 10:00 AM – 6:00 PM\nSaturday: 10:00 AM – 2:00 PM\nSunday: Closed"
      },
      "website": null,
      "concordances": [
        "google_places:ChIJlQ2qLAvQwoARBGIUaM8HBGg"
      ],
      "last_verified_by_vts": "2021-04-02T18:28:01+00:00",
      "vts_url": "https://www.vaccinatethestates.com/?lng=-118.08878&lat=34.06234#recAeW9YJKj2nslIV"
    }
  ]
}

simonw commented 3 years ago

The missing feature right now is that vaccines_offered needs to be populated using the hack from #504. I'm going to try and use an expansions-style mechanism for this, see #557.

Then we need tests for this all, and finally a mechanism for actually pushing it to a GCS bucket.

alexmv commented 3 years ago

I implemented streaming upload in #554 last night, and changed the API to take Iterator[str] so it should be reasonably straightforward to plug those together.

simonw commented 3 years ago

I implemented the horrible vaccinefinder hack in #557 - need to get the export to GCS working next.

simonw commented 3 years ago

Right now doing POST /api/export causes the exports to be generated and uploaded for the api.vaccinateca.com stuff.

I'm concerned that the new export for api.vaccinatethestates.com will take significantly longer, since it's a larger set of locations. So I'm not going to add it to the /api/export endpoint.

Instead I'll set up POST /api/exportVaccinateTheStates to trigger this export.

simonw commented 3 years ago

I think the neatest way to run this may be to have the export endpoint make an internal call to the /api/searchLocations view under the hood, since that already returns a streaming data structure which an be passed to the new streaming-to-GCS code Alex wrote.

simonw commented 3 years ago

Here's the code I wrote for simulating an internal request to /api/searchLocations?all=1&format=v0preview and then iterating over and writing the results to storage: https://github.com/CAVaccineInventory/vial/blob/edb24c47739c4ae421003b6b77d071f5c4ac40fa/vaccinate/core/exporter/__init__.py#L43-L59

I had to add mechanisms for skipping API logging and skipping JWT auth to get this to work.

simonw commented 3 years ago

Tests there passed in GitHub Actions but failed in Cloud Build with this error:

Step #9 - "test image": ERROR    django.request:log.py:224 Internal Server Error: /api/exportVaccinateTheStates
Step #9 - "test image": ------------------------------ Captured log call -------------------------------
Step #9 - "test image": core/exporter/__init__.py:49: KeyError
Step #9 - "test image": 
Step #9 - "test image": E       KeyError: 'unknown'
Step #9 - "test image": >       writer = VTS_DEPLOYS[os.environ.get("DEPLOY", "testing")]
Step #9 - "test image":         response = search_locations(request)
Step #9 - "test image":         request.skip_api_logging = True  # type: ignore[attr-defined]
Step #9 - "test image":         request.skip_jwt_auth = True  # type: ignore[attr-defined]
Step #9 - "test image":         request.user = AnonymousUser()
Step #9 - "test image":         request = RequestFactory().get("/api/searchLocations?all=1&format=v0preview")

simonw commented 3 years ago

Testing it by running this:

curl -XPOST 'https://vial-staging.calltheshots.us/api/exportVaccinateTheStates' -d ''

simonw commented 3 years ago

Took a while (I should have timed it) and returned an ok status but https://console.cloud.google.com/storage/browser/vaccinateca-api-vial-staging does not show the new file.

I tried doing this too:

curl -XPOST 'https://vial-staging.calltheshots.us/api/export -d ''

To trigger the current export code. That also churned for a while before returning an ok response, but likewise it didn't seem to have updated the JSON in those buckets.

simonw commented 3 years ago

Meanwhile running this:

curl -o locations.json\
   'https://vial-staging.calltheshots.us/api/searchLocations?all=1&format=v0preview' \
   -H 'Authorization: Bearer 24:a1da70...'

Produced a 31MB locations.json file which appears to have the correct content. I've uploaded that file here if anyone wants to have a look:

https://static.simonwillison.net/static/2021/vial-staging-v0preview-locations.json

simonw commented 3 years ago

I think I can improve performance a bunch by ditching the internal HTTP request trick, because then I can use the version of the expansion which loads all of the vaccine finder data in one go.

alexmv commented 3 years ago

I think I can improve performance a bunch by ditching the internal HTTP request trick, because then I can use the version of the expansion which loads all of the vaccine finder data in one go.

https://ui.honeycomb.io/vaccinateca/datasets/vial-staging/result/HQwyAyHDYWw/trace/ffunx8PnUTP shows O(n) queries into source_location -- is that what you mean?

alexmv commented 3 years ago

Took a while (I should have timed it) and returned an ok status but https://console.cloud.google.com/storage/browser/vaccinateca-api-vial-staging does not show the new file.

Ah, confusion -- that was a short-term bucket used to test VIAL output when we were cutting over for airtable-export. The api.vaccinateca.com bucket is https://console.cloud.google.com/storage/browser/vaccinateca-api-staging

simonw commented 3 years ago

I deployed that optimization to production just now.

Here's a trace from before the optimization: https://ui.honeycomb.io/vaccinateca/datasets/vial-production/result/AEe2VaC1rDj/trace?trace_id=a23d6760ad921e1a9ae3777af8b183f2 - took 123 seconds and ran probably around 65,000 SQL queries (one per returned location).

simonw commented 3 years ago

After deploying the optimization:

(vial) vial % time curl -XPOST 'https://vial.calltheshots.us/api/exportVaccinateTheStates' -d ''
{"ok": 1}0.01s user 0.01s system 0% cpu 1:06.40 total

65s. Still a lot of traces, because I'm running a query per batch of 100 returned rows - and it returns 68,000 rows so likely executes 680 queries.

Trace: https://ui.honeycomb.io/vaccinateca/datasets/vial-production/result/nT4E1WMMrGE/trace/pXbDF1yRMBq

~ % curl 'https://api.vaccinatethestates.com/v0/locations.json' | jq '.content | length'  
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 44.7M    0 44.7M    0     0  27.0M      0 --:--:--  0:00:01 --:--:-- 27.0M
68327
~ %

I'm tempted to bump the default batch size up to 1000 here: https://github.com/CAVaccineInventory/vial/blob/7386fc863311582dedb4b501543f2f162b84ec10/vaccinate/api/serialize.py#L27-L29

simonw commented 3 years ago

Still took 63s - from this trace https://ui.honeycomb.io/vaccinateca/datasets/vial-production/result/wesawhore23/trace/4mmgAJNkwEE it's clear that actually it's the prefetch_related() calls from the keyset pagination that are the limit now. Those happen returning 500 rows at a time so I think that's fine. I'm going to call this "good enough" and close the ticket.

Banners_and_Alerts_and_vial-production_Trace___Honeycomb_io

simonw commented 3 years ago

The new published API is at https://api.vaccinatethestates.com/v0/locations.json

CAVaccineInventory / vial

v0preview exported API of all our locations #558