CAVaccineInventory / vial

The Django application powering calltheshots.us
https://vial.calltheshots.us
MIT License
13 stars 1 forks source link

Copy, convert and load in example data from Airtable #9

Closed simonw closed 3 years ago

simonw commented 3 years ago

I want to get all of the Locations and Reports from Airtable into the Django prototype - once I've tightened security, see #8

simonw commented 3 years ago

I'm going to try loading this directly from our airtable backup repo, to avoid having to pagínate through airtable - I can ingest a full JSON file from our backups in one go.

simonw commented 3 years ago

Some of JSON files are in the order of 30+MB.

I'm going to try writing Django management commands that can be given a URL and authentication credentials and can then stream the into the database.

I can use ijson to do this efficiently, see note here: https://github.com/simonw/sqlite-utils/issues/173#issuecomment-714758139

simonw commented 3 years ago

One of these tricks for downloading a file from a private repo should hopefully work: https://stackoverflow.com/questions/18126559/how-can-i-download-a-single-raw-file-from-a-private-github-repo-using-the-comman

simonw commented 3 years ago

I tried this:

r = httpx.get(
    "https://api.github.com/repos/CAVaccineInventory/airtable-data-backup/contents/backups/Locations.json",
    headers={
        "Authorization": "token {}".format(token),
        "Accept": "application/vnd.github.v4.raw"
    }
)

And got back an error: {'message': 'This API returns blobs up to 1 MB in size. The requested blob is too large to fetch via the API, but you can use the Git Data API to request blobs up to 100 MB in size.', 'errors': [{'resource': 'Blob', 'field': 'data', 'code': 'too_large'}], 'documentation_url': 'https://docs.github.com/rest/reference/repos#get-repository-content'}

https://docs.github.com/rest/reference/repos#get-repository-content

simonw commented 3 years ago

Looks like I've run into this problem before! https://github.com/simonw/irma-scrapers/issues/1

simonw commented 3 years ago

... and I wrote a Python library that solves it. Turns out it was hard-coded to master as the branch name though, so I just pushed a new version that defaults to main: https://github.com/simonw/github-contents/releases/0.2

simonw commented 3 years ago
from github_contents import GithubContents
github = GithubContents(
    "CAVaccineInventory",
    "airtable-data-backup",
    token
)
content, sha = github.read_large("backups/Locations.json")
len(content)
# Outputs 33023248
simonw commented 3 years ago

I'll build a separate management command for each import, with names like this:

./manage.py import_airtable_locations --github-token=...
./manage.py import_airtable_reports --github-token=...
simonw commented 3 years ago

They'll also take an option called --json-file=path-to-file so I can iterate on them more quickly by downloading the file to my laptop (instead of pulling over HTTPS each time).

simonw commented 3 years ago

For locations I'll start with just the required fields:

simonw commented 3 years ago

Importing location data is a great way to spot data errors! Lots of records are missing names, or counties, or latitudes and longitudes.

simonw commented 3 years ago

Here's the output of a successful import:

Skipping rec0xZ5EaKnnynfDa [name=None], reason=No name
Skipping rec7nHXCuSYRR61V0 [name=None], reason=No name
Skipping rec8Xk6kn4SvAKeEm [name=None], reason=No name
Skipping recCYZZRJCRlXykun [name=None], reason=No name
Skipping recJt0iQbqmglF0XL [name=Dignity Health (Woodland)], reason=No county
Skipping recM2vjVCasanHStO [name=None], reason=No name
Skipping recMFogi5BI5krqKK [name=Walgreens Pharmacy #4559], reason=No county
Skipping recMSTzVRFGEuxS6M [name=None], reason=No name
Skipping recTzWO20IAjnrcdS [name=PAMF Mountain View Center (Palo Alto Medical Foundation - Sutter)], reason=No county
Skipping recWqQc3bG7PiKQJi [name=Sharp Coronado Hospital], reason=No latitude
Skipping reccaeiuCMw6a7Dvg [name=None], reason=No name
Skipping recchGT7ticPxfr42 [name=None], reason=No name
Skipping recgudgmJck0ucv3X [name=All-inclusive], reason=No county
Skipping reck8nN8LE7qUu3WY [name=None], reason=No name
Skipping recwHJmOChMc6TeA8 [name=Osage inglewood], reason=No county
Skipping recy40U4HXu6hSmQ3 [name=South Medical], reason=No county
Skipping reczTSHfKgH9MyS4m [name=5148 Market St], reason=No county
simonw commented 3 years ago

I ran the import script against Heroku:

% heroku run bash -a vaccinateca-preview
Running bash on ⬢ vaccinateca-preview... up, run.3026 (Hobby)
~ $ vaccinate/manage.py import_airtable_locations --github-token xxx
Skipping rec0xZ5EaKnnynfDa [name=CVS Pharmacy® & Drug Store at 25272 Marguerite Pkwy, Mission Viejo, CA 92692], reason=No latitude
Skipping rec7nHXCuSYRR61V0 [name=None], reason=No name
Skipping rec8Xk6kn4SvAKeEm [name=None], reason=No name
Skipping recCYZZRJCRlXykun [name=None], reason=No name
Skipping recJt0iQbqmglF0XL [name=Dignity Health (Woodland)], reason=No county
Skipping recM2vjVCasanHStO [name=None], reason=No name
Skipping recMFogi5BI5krqKK [name=Walgreens Pharmacy #4559], reason=No county
Skipping recMSTzVRFGEuxS6M [name=CVS Pharmacy® & Drug Store at 638 Camino De Los Mares, San Clemente, CA 92673], reason=No latitude
Skipping recTzWO20IAjnrcdS [name=PAMF Mountain View Center (Palo Alto Medical Foundation - Sutter)], reason=No county
Skipping recWqQc3bG7PiKQJi [name=Sharp Coronado Hospital], reason=No latitude
Skipping reccaeiuCMw6a7Dvg [name=None], reason=No name
Skipping recchGT7ticPxfr42 [name=None], reason=No name
Skipping recgudgmJck0ucv3X [name=All-inclusive], reason=No county
Skipping reck8nN8LE7qUu3WY [name=None], reason=No name
Skipping recwHJmOChMc6TeA8 [name=Osage inglewood], reason=No county
Skipping recy40U4HXu6hSmQ3 [name=South Medical], reason=No county
Skipping reczTSHfKgH9MyS4m [name=5148 Market St], reason=No county

https://vaccinateca-preview.herokuapp.com/admin/core/location/ now lists 7038 locations!

simonw commented 3 years ago
Banners_and_Alerts_and_Select_location_to_change___Django_site_admin
simonw commented 3 years ago

Next step: import calls, then build a custom staff-only page that shows the history of calls for each location.

simonw commented 3 years ago

The location data isn't populating the city field at the moment - it may be able to do that by running https://github.com/datamade/usaddress against the address.

simonw commented 3 years ago

Importing calls is a lot harder than locations.

There are some schema differences that I don't understand, see comments in this block of code: https://github.com/CAVaccineInventory/django.vaccinate/blob/12514c713686e640036289c0411a59af5df9d4ed/vaccinate/core/import_utils.py#L74-L117

Running it also reveals a LOT of records that are missing fields, or that have unexpected values. I imagine this is because of schema evolution over time - records created earlier in the project will be missing values for fields that were added later on.

We need to decide what to do about these. The new database schema has required fields that may not be present in older data.

simonw commented 3 years ago

Running what I've got so far against Heroku:

heroku run bash -a vaccinateca-preview
Running bash on ⬢ vaccinateca-preview... up, run.8449 (Hobby)
~ $ vaccinate/manage.py import_airtable_reports --github-token=...
Skipping rec02F871ba5FXkyJ, reason=Invalid tag: Vaccinating essential workers
Skipping rec0Ex1VKAO92h80z, reason=Invalid tag: Vaccinating essential workers
Skipping rec0Iubkfi8Y2G4zu, reason=Invalid tag: No: may be a vaccination site in the future
Skipping rec0Ten9dpIF83YHJ, reason=Invalid tag: Scheduling second dose only
Skipping rec0VyB5IDsXBgRjB, reason=Invalid tag: No: may be a vaccination site in the future
Skipping rec0WK4DO9bMX3s2N, reason=Invalid tag: Yes: restricted to city residents
...
simonw commented 3 years ago

That successfully imported 21,029 call records. It skipped 962, listed in this gist: https://gist.github.com/simonw/75490c6bf71161006520909f06639567

simonw commented 3 years ago

Here are those reasons for not importing counted:

% cat /tmp/reasons.txt | uniq -c
 550 Invalid tag: No: may be a vaccination site in the future
  59 Invalid tag: Scheduling second dose only
 222 Invalid tag: Vaccinating essential workers
   8 Invalid tag: Yes: restricted to city residents
  61 Missing Availability
  59 No Location key in JSON object at all
   1 No location record for location ID=['rec0xZ5EaKnnynfDa']
   1 No location record for location ID=['rec4HjzAYNRjP9CYU']
   1 No location record for location ID=['recMSTzVRFGEuxS6M']%                                                                                                                         
simonw commented 3 years ago

Should "No: may be a vaccination site in the future" be mapped to "Yes: coming soon" (described as "This location is currently closed but is slated to open soon. Typically Latest report notes will have more information in this case.")?

We don't have an availability tag that matches "Scheduling second dose only" at all

Should "Vaccinating essential workers" map to "No: only vaccinating health care workers"?

Should "Yes: restricted to city residents" be mapped to "Yes: restricted to county residents" ?

nschiefer commented 3 years ago

Looking over the import code I think we need to think carefully about how to match the Airtable schema to the new one. It is... non-trivial :-)

A few specific things:

eliblock commented 3 years ago

Many of our reports are not "reports" at all

Luckily we run a function within AirTable to identify the "type" of each report! All reports should have a column called something like "Report Type" which you can use (I think we want Volunteer for the first round of this effort)

simonw commented 3 years ago

I'm going to split out continued work on the call importer to a separate ticket.

simonw commented 3 years ago

Closing this since work is now in #15