Populate Location fields: vaccines_offered, accepts_appointments and accepts_walkins

simonw commented 3 years ago

Replaces #504 - needed by #649.

From https://docs.google.com/document/d/17svyCVXcloArj1wbUgu7QwEb6xeIDa-p7C3U0_yLgKg/edit

BASIC RULES

For durable fields (more-or-less: info about the location itself), we should not update existing information with each run of the ingestor / each new data source

Open question: If a new source has data in a currently-empty field, should we import it?

For frequently-changing fields (more-or-less: info about vaccine distribution at the site), we should update information with each report, each run of the ingestor, and each new data source

FIELDS THAT INGESTORS SHOULD BE ABLE TO OVERWRITE

Which vaccines are available

The "Appointments available" tag

The "Walk-ins accepted" / "Appointment required" / "Appointments or walk-ins accepted" tags

Plain-text "notes about this location" (if the feed has them; this is equivalent to the "Public notes" fields on locations)

FIELDS THAT INGESTORS SHOULD NEVER OVERWRITE

Address, phone #, or hours

Could be: okay to overwrite unless they've been manually confirmed or edited in some way?

Website

Phone #

Hours

SPECIFIC ONE-OFF RULES:

We trust VF more than anything else for which vaccines are available. If the location exists in VF, we should show VF's versions of which vaccines are available.

We trust Vaccine Spotter more than anything else for whether appointments are available

If we have public notes from a manual report that are less than a week old, we trust those more than any ingested notes field

simonw commented 3 years ago

The fields in question, added in #494: https://github.com/CAVaccineInventory/vial/blob/14c9098af1ad6b74aefdce84dda3e5a7719e070b/vaccinate/core/models.py#L241-L252

These were editable in the VIAL interface for a while - they are no longer editable (or at least they are hidden by default). The number of locations with these fields populated is:

https://vial.calltheshots.us/dashboard/?sql=select+%27vaccines_offered%27+as+label%2C+count%28%2A%29+from+location+where+vaccines_offered+is+not+null%0D%0Aunion+all%0D%0Aselect+%27accepts_appointments%27+as+label%2C+count%28%2A%29+from+location+where+accepts_appointments+is+not+null%0D%0Aunion+all%0D%0Aselect+%27accepts_walkins%27+as+label%2C+count%28%2A%29+from+location+where+accepts_walkins+is+not+null%0D%0Aunion+all%0D%0Aselect+%27public_notes%27+as+label%2C+count%28%2A%29+from+location+where+public_notes+is+not+null+and+public_notes+%21%3D+%27%27%3A8j--LJ1Eyo7eKpaJLN7jOVQGdUI_Y3MBPeygfoytBJ4&sql=select%3AfGunWs486BWXY9jrd1QIMrBUTKRwu9OMHinDrR7m2Rg

label   count
vaccines_offered    109
accepts_walkins 133
accepts_appointments    158
public_notes    45

This is from the short period of time when these were editable. I'm going to export that data and otherwise pretend it didn't exist.

simonw commented 3 years ago

select
  id, public_id, name, full_address,
  vaccines_offered, accepts_walkins, accepts_appointments, public_notes
from
  location
where (
  vaccines_offered is not null
    or
  accepts_appointments is not null
    or 
  accepts_walkins is not null
    or
  (public_notes is not null and public_notes != '')
)

Exported data is here: https://gist.github.com/simonw/d7644d1f444bc4221b3b284f73468360

simonw commented 3 years ago

There are two steps here: backfill the existing data, and ensure that when new reports or source locations are ingested the data is updated to reflect our best available versions.

simonw commented 3 years ago

This query is really useful:

with source_location_info as (
  select
    id,
    source_uid,
    json_extract_path(import_json::json, 'availability') as availability,
    json_extract_path(import_json::json, 'inventory') as inventory
  from
    source_location
)
select
  *
from
  source_location_info
where
  availability is not null
  or inventory is not null
limit
  100

https://vial.calltheshots.us/dashboard/?sql=with+source_location_info+as+%28%0D%0A++select%0D%0A++++id%2C%0D%0A++++source_uid%2C%0D%0A++++json_extract_path%28import_json%3A%3Ajson%2C+%27availability%27%29+as+availability%2C%0D%0A++++json_extract_path%28import_json%3A%3Ajson%2C+%27inventory%27%29+as+inventory%0D%0A++from%0D%0A++++source_location%0D%0A%29%0D%0Aselect%0D%0A++%2A%0D%0Afrom%0D%0A++source_location_info%0D%0Awhere%0D%0A++availability+is+not+null%0D%0A++or+inventory+is+not+null%0D%0Alimit%0D%0A++100%3A5U1FLE_5WuQpPx1NyPlqNbVoZDkz8vP3bVBnzNt_Ze0

simonw commented 3 years ago

with source_location_info as (
  select
    id,
    source_uid,
    json_extract_path(import_json::json, 'availability') as availability,
    json_extract_path(import_json::json, 'inventory') as inventory
  from
    source_location
)
select
  count(*)
from
  source_location_info
where
  availability is not null
  or inventory is not null

Too long to run through the dashboard, so I used an unlimited local connection - returned 155,304 - select count(*) from source_location returns 238,815

simonw commented 3 years ago

with source_location_info as (
  select
    id,
    source_uid,
    matched_location_id,
    json_extract_path(import_json::json, 'availability') as availability,
    json_extract_path(import_json::json, 'inventory') as inventory
  from
    source_location
)
select
  count(distinct matched_location_id)
from
  source_location_info
where
  availability is not null
  or inventory is not null

Takes 17.8s and returns 63,730 - and since select count(*) from location where soft_deleted = false returns 75,478 the VAST majority of our locations have useful availability or inventory information in their matched source locations.

simonw commented 3 years ago

I changed that where to:

where
  availability is not null
  and inventory is not null

And it returned 50,228 locations that have a matched source location with BOTH of those fields.

simonw commented 3 years ago

As for reports...

with skip_reports as (
  select report_id from call_report_availability_tag where availabilitytag_id = (
    select id from availability_tag where "group" = 'skip'
  )
)
select count(distinct location_id) from report where id not in (select report_id from skip_reports) and vaccines_offered is not null

https://vial.calltheshots.us/dashboard/?sql=with+skip_reports+as+%28%0D%0A++select+report_id+from+call_report_availability_tag+where+availabilitytag_id+%3D+%28select+id+from+availability_tag+where+%22group%22+%3D+%27skip%27%29%0D%0A%29%0D%0Aselect+count%28distinct+location_id%29+from+report+where+id+not+in+%28select+report_id+from+skip_reports%29+and+vaccines_offered+is+not+null%3AKJargSIxiGOHI9fxg5CQLIfe03j2yIH-2ZP2LwRLlHY

Returns 15,760 - there are 15,760 locations for which we have at least one non-skip report which has populated vaccines_offered data.

simonw commented 3 years ago

One way to look at this is that we have a sequence of opinions about which vaccines are offered - from imported source locations and from reports.

Slight hitch is that since we over-write source locations when we import them we don't have the full history of those opinions stored in our PostgreSQL database - though we likely have them in a git history somewhere.

simonw commented 3 years ago

Also interesting: which of our locations have the most matched source locations?

https://vial.calltheshots.us/dashboard/?sql=select%20%22public_id%22%2C%20count%28%2A%29%20as%20n%20from%20%28select%20source_uid%2C%20location.public_id%20from%20source_location%20join%20location%20on%20source_location.matched_location_id%20%3D%20location.id%29%20as%20results%20group%20by%20%22public_id%22%20order%20by%20n%20desc%3AJZLqOzIQWCMC6HTTdvtjaR0YPMDy89R9V_9x3ngSxSQ

lpptz   63
lxfpc   50
lkqqg   43
ltgbp   33
lxfqk   31
ltfrw   26
lytcz   26
ltchz   25
ltbwh   24
ldmgq   24
lzhfr   23
ldmpt   23
ldmbg   22
lxdxp   22
lxtrg   21
ldkyw   21
lrcqm   21
ldktw   21
lxdxm   21
ldqyp   20
lrbxc   20
ltghm   20
ldwhh   20
lhydw   20
ldwzy   20
ldmfh   20
lcdfch  20
lcdffy  20
ldkyz   20
ldqxr   20
ldkwx   20
ldkyr   20
ldkzr   20
ldmzt   20
ldkwy   20
ldkwh   20
ldkyh   20
ldxqy   20
lyhch   20
ldkzz   20
ldkym   19

https://vial-staging.calltheshots.us/location/lpptz is the top one - that's Walgreens Co. #19134 in CT - because of the CT scrapers: https://vial.calltheshots.us/dashboard/?sql=select+source_uid%2C+source_name%2C+name+from+source_location+where+matched_location_id+%3D+35007%3AncVcHDs5Axbk6K_g7YTGp8BW-m7RT3HWkoABzBgzpgE

Actually that looks bad - I think a bunch of different CT Walgreens may have been incorrectly matched:

source_uid  source_name name
ct_covidvaccinefinder_gov:605a3748494866ed91e08065  ct_covidvaccinefinder_gov   Walgreens Pharmacy (Glastonbury)
ct_covidvaccinefinder_gov:605a374a494866ed91e0807c  ct_covidvaccinefinder_gov   Walgreens Pharmacy (Hartford, Albany Ave)
ct_covidvaccinefinder_gov:605a374a494866ed91e0807f  ct_covidvaccinefinder_gov   Walgreens Pharmacy (Bristol, Main St)
ct_covidvaccinefinder_gov:605a374c494866ed91e0808a  ct_covidvaccinefinder_gov   Walgreens Pharmacy (Terryville)
ct_covidvaccinefinder_gov:605a374d494866ed91e08095  ct_covidvaccinefinder_gov   Walgreens Pharmacy (Thomaston)
ct_covidvaccinefinder_gov:605a3750494866ed91e080ba  ct_covidvaccinefinder_gov   Walgreens Pharmacy (Litchfield)

simonw commented 3 years ago

I'm going to exclude ct_covidvaccinefinder_gov from this project for the moment.

simonw commented 3 years ago

Sample of the values of availability and inventory taken by sampling the most recent 10,000 source locations that have them:

Availability: https://vial.calltheshots.us/dashboard/?sql=select+availability%3A%3Atext%2C+count%28%2A%29+as+n+from+%28with+source_location_info+as+%28%0D%0A++select%0D%0A++++json_extract_path%28import_json%3A%3Ajson%2C+%27availability%27%29+as+availability%2C%0D%0A++++json_extract_path%28import_json%3A%3Ajson%2C+%27inventory%27%29+as+inventory%2C%0D%0A++++id%2C+source_uid%2C+source_name%2C+name%2C+created_at%2C+matched_location_id%2C+last_imported_at%0D%0A++from%0D%0A++++source_location%0D%0A%29%0D%0Aselect%0D%0A++%2A%0D%0Afrom%0D%0A++source_location_info%0D%0Awhere%0D%0A++availability+is+not+null%0D%0A++or+inventory+is+not+null%0D%0Aorder+by+id+desc+limit+2000%0D%0A%29+as+results+group+by+availability%3A%3Atext+order+by+n+desc%3AeRwjzoJfg5JqhIgsoZyxzQ_vlFClqtZ4i0g84tza1_s

SQL__select_availability__text__count____as_n_from__with_source_location_info_as___select_json_extract_path_import_json__json___availability___as_availability__json_extract_path_import_json__json___inventory___as_inventory__id__source_uid__

Inventory: https://vial.calltheshots.us/dashboard/?sql=select+inventory%3A%3Atext%2C+count(*)+as+n+from+(with+source_location_info+as+(%0D%0A++select%0D%0A++++json_extract_path(import_json%3A%3Ajson%2C+%27availability%27)+as+availability%2C%0D%0A++++json_extract_path(import_json%3A%3Ajson%2C+%27inventory%27)+as+inventory%2C%0D%0A++++id%2C+source_uid%2C+source_name%2C+name%2C+created_at%2C+matched_location_id%2C+last_imported_at%0D%0A++from%0D%0A++++source_location%0D%0A)%0D%0Aselect%0D%0A++*%0D%0Afrom%0D%0A++source_location_info%0D%0Awhere%0D%0A++availability+is+not+null%0D%0A++or+inventory+is+not+null%0D%0Aorder+by+id+desc+limit+10000%0D%0A)+as+results+group+by+inventory%3A%3Atext+order+by+n+desc%3Aiy4Iupdi-OBx0VTi0hN349c7JoqWNwjaR3OBi0LSEbo

SQL__select_inventory__text__count____as_n_from__with_source_location_info_as___select_json_extract_path_import_json__json___availability___as_availability__json_extract_path_import_json__json___inventory___as_inventory__id__source_uid__sou

So those don't always have supply levels.

simonw commented 3 years ago

Here's the scraper code that sets drop_in: https://vaccinateca-ripgrep.datasette.app/-/ripgrep?pattern=drop_in&glob=

simonw commented 3 years ago

Found some examples that have both drop_in and appointments keys: https://vial.calltheshots.us/dashboard/?sql=with+source_location_info+as+%28%0D%0A++select%0D%0A++++json_extract_path%28import_json%3A%3Ajson%2C+%27availability%27%2C+%27drop_in%27%29+as+drop_in%2C%0D%0A++++json_extract_path%28import_json%3A%3Ajson%2C+%27availability%27%2C+%27appointments%27%29+as+appointments%2C%0D%0Aimport_json%2C%0D%0A++++id%2C+source_uid%2C+source_name%2C+name%2C+created_at%2C+matched_location_id%2C+last_imported_at%0D%0A++from%0D%0A++++source_location%0D%0A%29%0D%0Aselect%0D%0A++%2A%0D%0Afrom%0D%0A++source_location_info%0D%0Awhere%0D%0A++drop_in+is+not+null+and+appointments+is+not+null%0D%0Aorder+by+id+desc%0D%0Alimit%0D%0A++100%3AZ9Kb0ZodH_4oFxEVDkV2LN3YfWvdOQU55vTP7rmSn3w

simonw commented 3 years ago

I ran this (took 20s so not through Django SQL Dashboard) to see which sources have both drop_in and appointment records:

select "source_name", count(*) as n from (with source_location_info as (
  select
    json_extract_path(import_json::json, 'availability', 'drop_in') as drop_in,
    json_extract_path(import_json::json, 'availability', 'appointments') as appointments,
import_json,
    id, source_uid, source_name, name, created_at, matched_location_id, last_imported_at
  from
    source_location
)
select
  *
from
  source_location_info
where
  drop_in is not null and appointments is not null) as results group by "source_name" order by n desc

vaccinate_nj    1088
nyc_arcgis  590
sf_gov  74
mn_gov  22
dc_district 9
ct_state    5

simonw commented 3 years ago

Looks like vaccinefinder_org scrapes only ever have the drop_in true or false key, never the appointments key.

select json_extract_path(import_json::json, 'availability')::text as availability, count(*)
from source_location where source_name = 'vaccinefinder_org'
group by json_extract_path(import_json::json, 'availability')::text

{"drop_in": false}  41252
{"drop_in": true}   8259
    4584

simonw commented 3 years ago

I'm adding the following Location columns, in order to better understand the source of data that we show to our users:

+    vaccines_offered_provenance_report = models.ForeignKey(
+        "Report",
+        null=True,
+        blank=True,
+        related_name="+",
+        help_text="The report that last populated vaccines_offered",
+        on_delete=models.PROTECT,
+    )
+    vaccines_offered_provenance_source_location = models.ForeignKey(
+        "SourceLocation",
+        null=True,
+        blank=True,
+        related_name="+",
+        help_text="The source location that last populated vaccines_offered",
+        on_delete=models.PROTECT,
+    )
+    vaccines_offered_last_updated_at = models.DateTimeField(
+        help_text="When vaccines_offered was last updated",
+        blank=True,
+        null=True,
+    )
+
+    appointments_walkins_provenance_report = models.ForeignKey(
+        "Report",
+        null=True,
+        blank=True,
+        related_name="+",
+        help_text="The report that last populated accepts_walkins and accepts_appointments",
+        on_delete=models.PROTECT,
+    )
+    appointments_walkins_provenance_source_location = models.ForeignKey(
+        "SourceLocation",
+        null=True,
+        blank=True,
+        related_name="+",
+        help_text="The source location that last populated accepts_walkins and accepts_appointments",
+        on_delete=models.PROTECT,
+    )
+    appointments_walkins_last_updated_at = models.DateTimeField(
+        help_text="When accepts_walkins and accepts_appointments were last updated",
+        blank=True,
+        null=True,
+    )

simonw commented 3 years ago

My code doesn't (yet) populate those new columns - that's what the save=False parameter is going to do.

I'm pushing this live to staging and then I'll track down a bunch of interesting examples - locations with multiple reports and source locations - that I can use to demonstrate what the derive_availability_and_inventory() method is currently doing.

simonw commented 3 years ago

Once I'm comfortable with the behaviour of that method, I'll do the following:

Hook up the save=True parameter to populate those new database columns
Call that method any time a new report is filed, an updated source location is ingested or a new source location is matched
Add an API method we can call to apply that method to one (or maybe multiple) locations
Write a script that uses that new API method to backfill all of the data
Start returning the data from our various APIs

simonw commented 3 years ago

I want to find good example locations for this - locations that have both source locations AND reports against them which cover vaccines offered and availability.

Problem: we don't seem to have any on staging. Here's a query:

 with last_1000_vaccine_source_locations as (
  select * from source_location where json_extract_path(import_json::json, 'inventory') is not null
  order by id desc limit 1000
),
all_reports_with_vaccines as (
  select * from report where vaccines_offered is not null
)
select * from location where id in (
  select matched_location_id from last_1000_vaccine_source_locations
) and id in (
  select location_id from all_reports_with_vaccines
) limit 100

On staging, 0 results: https://vial-staging.calltheshots.us/dashboard/?sql=with+last_1000_vaccine_source_locations+as+(%0D%0A++select+*+from+source_location+where+json_extract_path(import_json%3A%3Ajson%2C+%27inventory%27)+is+not+null%0D%0A++order+by+id+desc+limit+1000%0D%0A)%2C%0D%0Aall_reports_with_vaccines+as+(%0D%0A++select+*+from+report+where+vaccines_offered+is+not+null%0D%0A)%0D%0Aselect+*+from+location+where+id+in+(%0D%0A++select+matched_location_id+from+last_1000_vaccine_source_locations%0D%0A)+and+id+in+(%0D%0A++select+location_id+from+all_reports_with_vaccines%0D%0A)+limit+10%3AkILYiM3iSoJ-PPLcMdgyz3RPr2xnMMh1UWzgBnvXRcg

On production, 65: https://vial.calltheshots.us/dashboard/?sql=with+last_1000_vaccine_source_locations+as+%28%0D%0A++select+%2A+from+source_location+where+json_extract_path%28import_json%3A%3Ajson%2C+%27inventory%27%29+is+not+null%0D%0A++order+by+id+desc+limit+1000%0D%0A%29%2C%0D%0Aall_reports_with_vaccines+as+%28%0D%0A++select+%2A+from+report+where+vaccines_offered+is+not+null%0D%0A%29%0D%0Aselect+%2A+from+location+where+id+in+%28%0D%0A++select+matched_location_id+from+last_1000_vaccine_source_locations%0D%0A%29+and+id+in+%28%0D%0A++select+location_id+from+all_reports_with_vaccines%0D%0A%29+limit+100%3AzRfr7zN9tlDxD50JIB_mKYd5XEzMQ3BinNCBS3RMBNc

Since the code I've written so far is completely safe - it shows things on a debug page but doesn't update any database records - I'm going to ship it to production in order to see more examples.

simonw commented 3 years ago

My biggest question here is around the trustworthiness of our scrapers. We don't want to discard information from a high- trustworthy source because some other scraper submitted a more recent source location containing less useful data.

I think the fix for that is going to be allow-listing the scrapers - maybe even start with only vaccinefinder_gov scrapers contributing to the result here. That's a small code change that can happen here:

https://github.com/CAVaccineInventory/vial/blob/344153e9427ca42dc83f2f294e7ef615fbda289f/vaccinate/core/models.py#L527-L532

https://github.com/CAVaccineInventory/vial/blob/344153e9427ca42dc83f2f294e7ef615fbda289f/vaccinate/core/models.py#L594-L599

I can use the new debug information on /location/x to confirm if this is a good idea or not, by first identifying good examples of locations that have different source location source names with different opinions on vaccines and availability.

simonw commented 3 years ago

https://vial.calltheshots.us/dashboard/?sql=select+location.public_id%2C+count%28%2A%29%2C+array_agg%28source_location.source_name%29%2C+count%28distinct+source_location.source_name%29+as+source_name_count%0D%0Afrom+source_location+join+location+on+source_location.matched_location_id+%3D+location.id%0D%0Agroup+by+location.public_id%0D%0Ahaving+count%28%2A%29+%3E+1%0D%0Aorder+by+source_name_count+desc%3AyJHHgxP6WhS5hZOPmDJxJznetZsjW3n4W6rXXvgiYHs

select location.public_id, count(*), array_agg(source_location.source_name), count(distinct source_location.source_name) as source_name_count
from source_location join location on source_location.matched_location_id = location.id
group by location.public_id
having count(*) > 1
order by source_name_count desc

Here's a top result from that: https://vial.calltheshots.us/location/lqwzd

simonw commented 3 years ago

This variant of that query returns only locations that also have at least one non-skip report:

select
  location.public_id,
  count(*) as num_source_locations,
  array_agg(distinct source_location.source_name),
  count(distinct source_location.source_name) as num_distinct_source_names
from
  source_location
  join location on source_location.matched_location_id = location.id
where
  -- Only locations that have at least one non-skip report
  location.id in (
    select location_id from report where report.id not in (select report_id from call_report_availability_tag where availabilitytag_id = 20)
  )
group by
  location.public_id
having count(*) > 1
order by num_distinct_source_names desc

https://vial.calltheshots.us/dashboard/?sql=select%0D%0A++location.public_id%2C%0D%0A++count%28%2A%29+as+num_source_locations%2C%0D%0A++array_agg%28distinct+source_location.source_name%29%2C%0D%0A++count%28distinct+source_location.source_name%29+as+num_distinct_source_names%0D%0Afrom%0D%0A++source_location%0D%0A++join+location+on+source_location.matched_location_id+%3D+location.id%0D%0Awhere%0D%0A++--+Only+locations+that+have+at+least+one+non-skip+report%0D%0A++location.id+in+%28%0D%0A++++select+location_id+from+report+where+report.id+not+in+%28select+report_id+from+call_report_availability_tag+where+availabilitytag_id+%3D+20%29%0D%0A++%29%0D%0Agroup+by%0D%0A++location.public_id%0D%0Ahaving+count%28%2A%29+%3E+1%0D%0Aorder+by+num_distinct_source_names+desc%3ABvQBhHy5ELiTQIN_jTAcxtYxRjQfYn8EtUUJUMuru-I

simonw commented 3 years ago

https://vial.calltheshots.us/location/lykhz is an interesting example:

simonw commented 3 years ago

https://vial.calltheshots.us/dashboard/?sql=select%20%22json_extract_path%22%2C%20count%28%2A%29%20as%20n%20from%20%28select%20json_extract_path%28import_json%3A%3Ajson%2C%20%27availability%27%29%3A%3Atext%20from%20source_location%20where%20source_name%20%3D%20%27getmyvax_org%27%29%20as%20results%20group%20by%20%22json_extract_path%22%20order%20by%20n%20desc%3APE6S3PfuMROXLq15I658NslW4tlc2PMasbC3HeyDqBI confirms that getmyvax_org has NEVER returned a "dropins" key, it only ever populates appointments:

SQL__select__json_extract_path___count____as_n_from__select_json_extract_path_import_json__json___availability____text_from_source_location_where_source_name____getmyvax_org___as_results_group_by__json_extract_path__order_by_n_desc

Running this against the DB:

select "json_extract_path", count(*) as n from (select json_extract_path(import_json::json, 'availability')::text from source_location ) as results group by "json_extract_path" order by n desc

Returns this:

NULL    94952
{"appointments": true}  75839
{"drop_in": false}  41698
{"appointments": false} 12170
{"drop_in": true}   8684
{}  5143
{"drop_in": false, "appointments": true}    1288
{"drop_in": false, "appointments": false}   384
{"drop_in": true, "appointments": true} 102
{"drop_in": true, "appointments": false}    16

So the drop_in key is set to true for 8,000+ locations.

Maybe if the most recent source location has no explicit opinion on drop-ins we should fall back to the most recent report, if one exists?

simonw commented 3 years ago

I'm going to upgrade the display of that derived data on the /location page to make it much easier to interpret.

simonw commented 3 years ago

Much better: https://vial-staging.calltheshots.us/location/recUDJ9KD91QWRqwb

Alameda_County_Fairground_Immunization_Clinic_-_VIAL

simonw commented 3 years ago

Examples I need to find:

A location with many reports and no source locations
A location with many source locations and no reports
A location with a mix of source locations and reports where a report is most recent
A location with a mix of source locations and reports where a source location is most recent

simonw commented 3 years ago

https://vial.calltheshots.us/location/lkqqg has a LOT of source locations from "prepmod", no reports
https://vial.calltheshots.us/location/lryyxv source locations from wyo_appt_portal
https://vial.calltheshots.us/location/lggqr has many il_juvare source locations
https://vial.calltheshots.us/location/ltfww has information from reports and source locations, most recent report wins
https://vial.calltheshots.us/location/lktcx has a vaccinefinder_org source location

And some reports examples using the query from https://github.com/CAVaccineInventory/vial/issues/650#issuecomment-863518479

https://vial.calltheshots.us/location/recHFn0LjJaLeM8Gi - a getmyvax source location wins, there are reports too
https://vial.calltheshots.us/location/lrgrq - report wins for vaccines, getmyvax_org source location wins for availability

Variant of above query:

select
  location.public_id,
  count(*) as num_source_locations,
  array_agg(distinct source_location.source_name),
  count(distinct source_location.source_name) as num_distinct_source_names
from
  source_location
  join location on source_location.matched_location_id = location.id
where
  -- Only locations that have at least one non-skip report
  location.id in (
    select location_id from report where report.id not in (select report_id from call_report_availability_tag where availabilitytag_id = 20)
  )
group by
  location.public_id
having count(*) > 1
and 'vaccinefinder_org' = any(array_agg(distinct source_location.source_name))
order by num_distinct_source_names desc

The and 'vaccinefinder_org' = any(array_agg(distinct source_location.source_name)) bit in the HAVING clause returns locations that have at least one vaccinefinder_org.

https://vial.calltheshots.us/location/recH9cpfiexxM8WkN - mixes source location and report.

simonw commented 3 years ago

I'm going to wrap up this work up by:

[x] Selecting some "trusted" source location scrapers as a starting point - based on https://vial.calltheshots.us/dashboard/?sql=select+%22source_name%22%2C+count%28%2A%29+as+n+from+%28with+recent+as+%28select+%2A+from+source_location%29%0D%0Aselect+source_name+from+recent%29+as+results+group+by+%22source_name%22+order+by+n+desc%3A7oqK-zE1sHTY7KGKRYW1-y3q2y8TlMalmlTIAE8vcLc and conversation with Rachel I'm going to use vaccinefinder_org and vaccinespotter_org and getmyvax_org, though maybe should use us_carbon_health too?
[x] Implementing the save=True option on the location.derive_availability_and_inventory() method
[ ] Hooking it up so new reports, new source locations and matched source locations call that method
[ ] Building an API endpoint for bulk calling derive_availability_and_inventory(save=True) on a bunch of locations
[ ] Running a script to trigger that API endpoint for every existing location
[ ] Update various exports to use the new fields (probably in a separate issue)

simonw commented 3 years ago

It looks like us_carbon_health doesn't populate either inventory or availability, see https://vial.calltheshots.us/dashboard/?sql=select%0D%0A++last_imported_at%2C%0D%0A++json_extract_path%28import_json%3A%3Ajson%2C+%27availability%27%29+as+availability%2C%0D%0A++json_extract_path%28import_json%3A%3Ajson%2C+%27inventory%27%29+as+inventory%0D%0Afrom+source_location%0D%0Awhere+source_name+%3D+%27us_carbon_health%27%0D%0Aorder+by+id+desc%0D%0Alimit+1000%3A2YDNN-RMQeRDRCIpqP4YznKqND7SV9_sYDCNqaqlVRQ - so I'll skip it for the moment.

simonw commented 3 years ago

vaccinespotter_org never populates inventory: https://vial.calltheshots.us/dashboard/?sql=select%0D%0A++last_imported_at%2C%0D%0A++json_extract_path%28import_json%3A%3Ajson%2C+%27availability%27%29+as+availability%2C%0D%0A++json_extract_path%28import_json%3A%3Ajson%2C+%27inventory%27%29+as+inventory%0D%0Afrom+source_location%0D%0Awhere+source_name+%3D+%27vaccinespotter_org%27%0D%0Aand+json_extract_path%28import_json%3A%3Ajson%2C+%27inventory%27%29+is+not+null%0D%0Aorder+by+id+desc%0D%0Alimit+1000%3ABPU5RE0yYS8hHPLxjUJgFP3NK5lBKjIRHHR3Sgqj22A

Both vaccinefinder_org and getmyvax_org populate both inventory and availability.

simonw commented 3 years ago

The fact that vaccinespotter_org doesn't populate inventory does not mean we need to exclude it - this code here only considers source locations that DID populate inventory: https://github.com/CAVaccineInventory/vial/blob/9b7cc301c059be7c87cf3dbbe720c6d65a41ca05/vaccinate/core/models.py#L541-L546

simonw commented 3 years ago

Just realized I need to exclude reports with is_pending_review=True as well. (Already excluding soft_deleted.)

simonw commented 3 years ago

I need to do another in-depth review of places that might add/remove/edit source locations and reports to make sure they all update the derived data correctly.

simonw commented 3 years ago

Here's a progress report on how population is going based on imported source locations and reports:

https://vial.calltheshots.us/dashboard/?sql=select+count%28%2A%29+from+location+where+vaccines_offered_last_updated_at+is+not+null%3Arp-wF28ywmDjfXh-Ie0-SSYdjkFpFOIr79xGimH7BTU&sql=select+count%28%2A%29+from+location+where+appointments_walkins_last_updated_at+is+not+null%3A8e3BMBVMaNJWTi34eaHVsk1uTJB_jPJI0Hge3VJE4kM

select count(*) from location where vaccines_offered_last_updated_at is not null says 13387

select count(*) from location where appointments_walkins_last_updated_at is not null says 13759

That's out of 77,067 not-soft-deleted locations.

simonw commented 3 years ago

I still need to run a back-fill mechanism for locations that haven't had a report or a source location import in the past week.

simonw commented 3 years ago

On staging https://vial-staging.calltheshots.us/dashboard/?sql=select+count%28%2A%29+from+location+where+vaccines_offered_last_updated_at+is+not+null%3Ax05v1QiDVM1fIntikQuIJSI7Umqolhu3YXvrdJsdPiU&sql=select+count%28%2A%29+from+location+where+appointments_walkins_last_updated_at+is+not+null%3AZPXVjIp8wjdVr8xjePGpvlDD4GyWUDSgnB8rbZbuPoQ both return around 9,000 records, presumably due to test source location imports run against staging.

CAVaccineInventory / vial

Populate Location fields: vaccines_offered, accepts_appointments and accepts_walkins #650