CAVaccineInventory / vial

The Django application powering calltheshots.us
https://vial.calltheshots.us
MIT License
13 stars 1 forks source link

Backfill concordances from source locations to matched locations #654

Closed simonw closed 3 years ago

simonw commented 3 years ago

Following https://github.com/CAVaccineInventory/vial/issues/559#issuecomment-858190408_

/api/updateSoureLocationMatch didn't used to copy across concordances from the source location to the location. It does now, as-of #559 - but we need to backfill all of the missing ones.

simonw commented 3 years ago

Need a DB query that shows all concordances from all source locations where the matched location is missing some of those concordances.

simonw commented 3 years ago

https://vial.calltheshots.us/dashboard/?sql=with+from_source_locations+as+%28%0D%0A++select%0D%0A++++concordanceidentifier_id%2C%0D%0A++++source_location.matched_location_id+as+location_id%0D%0A++from%0D%0A++++concordance_source_location%0D%0A++++join+source_location+on+source_location.id+%3D+concordance_source_location.sourcelocation_id%0D%0A++where%0D%0A++++matched_location_id+is+not+null%0D%0A%29%2C%0D%0Aon_locations+as+%28%0D%0A++select%0D%0A++++concordanceidentifier_id%2C%0D%0A++++location_id%0D%0A++from%0D%0A++++concordance_location%0D%0A%29%2C%0D%0Amissing+as+%28%0D%0A++select%0D%0A++++%2A%0D%0A++from%0D%0A++++from_source_locations%0D%0A++except%0D%0A++select%0D%0A++++%2A%0D%0A++from%0D%0A++++on_locations%0D%0A%29%0D%0Aselect%0D%0A++count%28%2A%29%0D%0Afrom%0D%0A++missing%3AJiFSCLcan1QQS43c9koz5p4924SeuPqfoxmYe57tPLQ&sql=with+on_locations+as+%28select+concordanceidentifier_id%2C+location_id%0D%0Afrom+concordance_location%29%0D%0Aselect+%2A+from+on_locations+limit+10%3AzbwI04nh2pY1UWNrp9Xm0oU4xV7-ylVg6tkh0KtS_Js

with from_source_locations as (
  select
    concordanceidentifier_id,
    source_location.matched_location_id as location_id
  from
    concordance_source_location
    join source_location on source_location.id = concordance_source_location.sourcelocation_id
  where
    matched_location_id is not null
),
on_locations as (
  select
    concordanceidentifier_id,
    location_id
  from
    concordance_location
),
missing as (
  select
    *
  from
    from_source_locations
  except
  select
    *
  from
    on_locations
)
select
  count(*)
from
  missing

Returns 148,002

simonw commented 3 years ago

But we should exclude some of the providers - in particular ct_gov: covidvaccinefinder_gov: ct_covidvaccinefinder_gov.

simonw commented 3 years ago

We have 144 authorities at the moment: https://vial.calltheshots.us/dashboard/?sql=select%20%22authority%22%2C%20count%28%2A%29%20as%20n%20from%20%28select%20id%2C%20authority%2C%20identifier%2C%20created_at%20from%20concordance_identifier%29%20as%20results%20group%20by%20%22authority%22%20order%20by%20n%20desc%3AydXh_yEhn8L7UhP0pZuK5zTvFVCUmDm6ilDzvbORH7I

simonw commented 3 years ago

I'm using this query to figure out what to backfill:

with on_source_locations as (
  select
    authority,
    identifier,
    concordanceidentifier_id,
    location.public_id as location_public_id
  from
    concordance_source_location
    join source_location on source_location.id = concordance_source_location.sourcelocation_id
    join concordance_identifier on concordance_identifier.id = concordanceidentifier_id
    join location on source_location.matched_location_id = location.id
  where
    matched_location_id is not null
),
on_locations as (
  select
    authority,
    identifier,
    concordanceidentifier_id,
    location.public_id as location_public_id
  from
    concordance_location
    join concordance_identifier on concordance_identifier.id = concordanceidentifier_id 
    join location on concordance_location.location_id = location.id
),
missing as (
  select
    *
  from
    on_source_locations
  except
  select
    *
  from
    on_locations
)
select * from missing

It takes 10s to run so I ran it against production PostgreSQL directly.

It produced 148,029 rows - here: missing-location-concordances.csv

simonw commented 3 years ago

Here are the number of concordances I will be adding by authority:

authority n
placekey 62015
us_carbon_health 13510
us_giscorps_vaccine_providers 10194
_tag_provider 9455
vaccinespotter_org 7865
cvs 7014
il_sfsites 6981
al_arcgis 6791
vaccinefinder_org 4875
tx_arcgis 2235
getmyvax_org 1701
kroger 1327
md_arcgis 1185
rite_aid 1094
wa_state 1022
in_arcgis 756
fl_state 673
vaxfinder_gov 658
nc_myspot_gov 607
pa_arcgis 581
mo_arcgis 566
nyc_arcgis 503
ca_los_angeles_gov 436
vaccinate_nj 420
wi_arcgis_map 409
az_arcgis 409
la_tableau 408
sc_arcgis 382
heb 269
mn_gov 258
health_mart 223
ky_govstatus 217
ok_vaccinate_gov 183
vtrcks 172
ia_state 155
ct_gov 153
ct_covidvaccinefinder_gov 153
walmart 146
wa_prepmod 137
il_juvare 115
prepmod 97
tn_vaccinate_gov 92
costco 92
walgreens 87
smiths 81
little_clinic 78
fred_meyer 74
kaiser_permanente 71
albertson 66
ak_arcgis 64
ri_arcgis 63
ga_dph 63
ak_clinic_list 60
safeway 57
wyo_appt_portal 49
shop_rite 47
ma_immunizations 44
sav_on 40
publix 38
co_colorado_gov 35
king_soopers 33
genoa_healthcare 29
immunizenevada_org 28
tom_thumb 22
randalls 22
al_jefferson 22
az_pinal_ph_vaccinelocations_gov 20
pick_n_save 19
ca_metrolink 19
ct_state 18
vons 14
qfc 14
ny_northwell_health 14
sf_gov 13
jewel_osco 13
weis 12
thrifty 12
kaiser_health_plan 10
ilvaccine_org 10
harris_teeter 10
sams 9
raleys 9
pavilions 8
price_chopper 6
hannaford 6
dc_district 6
acme 6
united_supermarket 5
hyvee 5
giant_food 5
giant_eagle 5
duane_reade 5
winn_dixie 4
market_street 4
us_physicians_immediate 3
thrifty_white 3
shaws 3
hart 3
giant 3
frys_food_and_drug 3
albertsons_market 3
pharmaca 2
dillons 2
az_ph_pinal_clinics_gov 2
amigos 2
wegmans 1
tx_memorialhermann 1
star_market 1
ralphs 1
pay_less 1
market_32 1
marianos 1
kta_super_stores 1
hartig 1
haggen 1
fresco_y_mas 1
carrs 1
simonw commented 3 years ago

Here's a tool for spot-checking that the IDs I am going to add look good: https://concordance-identifier-to-add.vercel.app/concordance-identifier-to-add?sql=select%0D%0A++authority%2C%0D%0A++identifier%2C%0D%0A++%27https%3A%2F%2Fvial.calltheshots.us%2Flocation%2F%27+%7C%7C+location_public_id+as+url%0D%0Afrom%0D%0A++concordances%0D%0Aorder+by%0D%0A++location_public_id

simonw commented 3 years ago

Created a datasette-placekey plugin to make it easier to visualize these: https://concordance-identifier-to-add.vercel.app/concordance-identifier-to-add?sql=select%0D%0A++authority%2C%0D%0A++identifier%2C%0D%0A++%27https%3A%2F%2Fvial.calltheshots.us%2Flocation%2F%27+||+location_public_id+as+url%2C%0D%0A++placekey_to_geo_latitude(identifier)+as+latitude%2C%0D%0A++placekey_to_geo_longitude(identifier)+as+longitude%0D%0Afrom%0D%0A++concordances%0D%0Awhere+authority+%3D+%27placekey%27%0D%0Aorder+by%0D%0A++location_public_id%0D%0Alimit+100

simonw commented 3 years ago

They look good to me. I'm going to import all 148,029 using a Python script. If this turns out to be a mistake we can revert it - we have the above record of what we added, plus we have the API logs.

simonw commented 3 years ago

Here's the script I'm using:

import httpx
api_key = "25:..."

def fetch_all():
    next_url = "https://concordance-identifier-to-add.vercel.app/concordance-identifier-to-add/concordances.json?_size=max"
    while next_url:
        data = httpx.get(next_url).json()
        yield from (dict(zip(data["columns"], row)) for row in data["rows"])
        next_url = data.get("next_url")

def batch(items, n):      
    for i in range(0, len(items), n): 
        yield items[i:i + n]

rows = list(fetch_all())

batches = list(batch(rows, 200))

def update_batch(batch):
     return httpx.post("https://vial.calltheshots.us/api/updateLocationConcordances", json={
        "update": {
            r["location_public_id"]: {
                "add": [
                    r["authority"] + ":" + r["identifier"]
                ]
            }
            for r in batch
        }
    }, headers={"Authorization": "Bearer {}".format(api_key)}, timeout=20)

from tqdm.notebook import tqdm

for b in tqdm(batches[1:]):
    update_batch(b)

Running now. Should take about 40 minutes.

simonw commented 3 years ago

It's now 36% done and a count at https://vial.calltheshots.us/dashboard/?sql=select+count(*)+from+concordance_location%3AfcdtXiCoqfYQ3DSXzbOg2PsnRwZBkK81GGSH12taqac shows 341,685 records in the concordance_location table.

simonw commented 3 years ago

https://vial.calltheshots.us/dashboard/?sql=select+%2A+from+api_log+where+api_key_id+%3D+25%3ABTpfamvr05SAenrmJjoab-XhrO0tNmCXssBmPiE6qqg shows the API calls made by the dedicated API key I set up for this job.

simonw commented 3 years ago

Backfill finished: 434,651 in concordance_location.

simonw commented 3 years ago

This query shows concordance identifiers attached to locations (not source locations) counted by authority: https://vial.calltheshots.us/dashboard/?sql=select%20%22authority%22%2C%20count%28%2A%29%20as%20n%20from%20%28select%20concordance_identifier.%2A%20from%20concordance_identifier%20join%20concordance_location%20on%20concordance_location.concordanceidentifier_id%20%3D%20concordance_identifier.id%29%20as%20results%20group%20by%20%22authority%22%20order%20by%20n%20desc%3ArQvLweAl_9hf5Jok0xi1LtCktPq9uGrZX2Wh_qzitkE

authority   n
placekey    88848
vaccinefinder_org   55574
us_carbon_health    46043
_tag_provider   42468
vaccinespotter_org  38908
getmyvax_org    27731
us_giscorps_vaccine_providers   22349
cvs 17332
walgreens   9061
google_places   8990
il_sfsites  7844
al_arcgis   7121
walmart 4774
health_mart 4285
tx_arcgis   4277
vtrcks  3685
kroger  3245
rite_aid    3132
wa_state    2488
fl_state    2313
sc_arcgis   2226
publix  2183
nc_myspot_gov   2025
md_arcgis   1711
safeway 1678
mo_arcgis   1656
wi_arcgis_map   1209
az_arcgis   1161
arcgis  1135
in_arcgis   1110
vaccinate_nj    1085
pa_arcgis   926
mn_gov  729
vaxfinder_gov   707
nyc_arcgis  702
la_tableau  690
ca_los_angeles_gov  626
costco  603
sams    582
ok_vaccinate_gov    545
ky_govstatus    541
ia_state    493
hyvee   474
ct_gov  423
ct_covidvaccinefinder_gov   409
tn_vaccinate_gov    386
winn_dixie  377
little_clinic   367
sav_on  333
heb 316
osco    288
vaccinefinder   264
meijer  256
stop_and_shop   253
harris_teeter   229
vons    226
prepmod 220
giant_eagle 213
il_juvare   186
weis    185
smiths  182
ri_arcgis   163
immunizenevada_org  162
hannaford   160
giant_food  154
ga_dph  153
shop_rite   150
king_soopers    147
fred_meyer  146
ak_clinic_list  144
hart    137
ct  129
albertson   126
wa_prepmod  123
frys_food_and_drug  123
tom_thumb   118
giant   118
sf_gov  114
kaiser_permanente   114
food_city   108
ma_immunizations    97
ak_arcgis   91
thrifty 89
ingles  85
price_chopper   84
wegmans 84
co_colorado_gov 81
cub_pharmacy    80
ralphs  72
acme    68
ct_state    62
covidvaccinefinder_gov  60
pick_n_save 59
brookshire  58
dillons 54
genoa_healthcare    53
marianos    41
raleys  41
united_supermarket  36
market_street   35
qfc 34
al_jefferson    33
randalls    32
food_lion   32
pavilions   31
ny_northwell_health 30
duane_reade 26
haggen  26
harps   24
ilvaccine_org   24
az_pinal_ph_vaccinelocations_gov    23
city_market 22
us_physicians_immediate 22
homeland    21
pharmaca    21
hartig  21
ca_metrolink    21
kaiser_health_plan  18
jewel_osco  18
harmons 18
medicap 18
thrifty_white   17
harveys 14
metro_market    10
dc_district 9
albertsons_market   9
bakers  9
wyo_appt_portal 8
carrs   8
pay_less    7
big_y   6
az_ph_pinal_clinics_gov 6
gerbes  5
shaws   5
tx_memorialhermann  4
fresco_y_mas    4
star_market 4
market_32   3
jayc    3
amigos  3
drugco  2
kta_super_stores    1