Open simonw opened 3 years ago
The fields in question, added in #494: https://github.com/CAVaccineInventory/vial/blob/14c9098af1ad6b74aefdce84dda3e5a7719e070b/vaccinate/core/models.py#L241-L252
These were editable in the VIAL interface for a while - they are no longer editable (or at least they are hidden by default). The number of locations with these fields populated is:
label count
vaccines_offered 109
accepts_walkins 133
accepts_appointments 158
public_notes 45
This is from the short period of time when these were editable. I'm going to export that data and otherwise pretend it didn't exist.
select
id, public_id, name, full_address,
vaccines_offered, accepts_walkins, accepts_appointments, public_notes
from
location
where (
vaccines_offered is not null
or
accepts_appointments is not null
or
accepts_walkins is not null
or
(public_notes is not null and public_notes != '')
)
Exported data is here: https://gist.github.com/simonw/d7644d1f444bc4221b3b284f73468360
There are two steps here: backfill the existing data, and ensure that when new reports or source locations are ingested the data is updated to reflect our best available versions.
This query is really useful:
with source_location_info as (
select
id,
source_uid,
json_extract_path(import_json::json, 'availability') as availability,
json_extract_path(import_json::json, 'inventory') as inventory
from
source_location
)
select
*
from
source_location_info
where
availability is not null
or inventory is not null
limit
100
with source_location_info as (
select
id,
source_uid,
json_extract_path(import_json::json, 'availability') as availability,
json_extract_path(import_json::json, 'inventory') as inventory
from
source_location
)
select
count(*)
from
source_location_info
where
availability is not null
or inventory is not null
Too long to run through the dashboard, so I used an unlimited local connection - returned 155,304 - select count(*) from source_location
returns 238,815
with source_location_info as (
select
id,
source_uid,
matched_location_id,
json_extract_path(import_json::json, 'availability') as availability,
json_extract_path(import_json::json, 'inventory') as inventory
from
source_location
)
select
count(distinct matched_location_id)
from
source_location_info
where
availability is not null
or inventory is not null
Takes 17.8s and returns 63,730 - and since select count(*) from location where soft_deleted = false
returns 75,478 the VAST majority of our locations have useful availability or inventory information in their matched source locations.
I changed that where to:
where
availability is not null
and inventory is not null
And it returned 50,228 locations that have a matched source location with BOTH of those fields.
As for reports...
with skip_reports as (
select report_id from call_report_availability_tag where availabilitytag_id = (
select id from availability_tag where "group" = 'skip'
)
)
select count(distinct location_id) from report where id not in (select report_id from skip_reports) and vaccines_offered is not null
Returns 15,760 - there are 15,760 locations for which we have at least one non-skip report which has populated vaccines_offered
data.
One way to look at this is that we have a sequence of opinions about which vaccines are offered - from imported source locations and from reports.
Slight hitch is that since we over-write source locations when we import them we don't have the full history of those opinions stored in our PostgreSQL database - though we likely have them in a git history somewhere.
Also interesting: which of our locations have the most matched source locations?
lpptz 63
lxfpc 50
lkqqg 43
ltgbp 33
lxfqk 31
ltfrw 26
lytcz 26
ltchz 25
ltbwh 24
ldmgq 24
lzhfr 23
ldmpt 23
ldmbg 22
lxdxp 22
lxtrg 21
ldkyw 21
lrcqm 21
ldktw 21
lxdxm 21
ldqyp 20
lrbxc 20
ltghm 20
ldwhh 20
lhydw 20
ldwzy 20
ldmfh 20
lcdfch 20
lcdffy 20
ldkyz 20
ldqxr 20
ldkwx 20
ldkyr 20
ldkzr 20
ldmzt 20
ldkwy 20
ldkwh 20
ldkyh 20
ldxqy 20
lyhch 20
ldkzz 20
ldkym 19
https://vial-staging.calltheshots.us/location/lpptz is the top one - that's Walgreens Co. #19134 in CT - because of the CT scrapers: https://vial.calltheshots.us/dashboard/?sql=select+source_uid%2C+source_name%2C+name+from+source_location+where+matched_location_id+%3D+35007%3AncVcHDs5Axbk6K_g7YTGp8BW-m7RT3HWkoABzBgzpgE
Actually that looks bad - I think a bunch of different CT Walgreens may have been incorrectly matched:
source_uid source_name name
ct_covidvaccinefinder_gov:605a3748494866ed91e08065 ct_covidvaccinefinder_gov Walgreens Pharmacy (Glastonbury)
ct_covidvaccinefinder_gov:605a374a494866ed91e0807c ct_covidvaccinefinder_gov Walgreens Pharmacy (Hartford, Albany Ave)
ct_covidvaccinefinder_gov:605a374a494866ed91e0807f ct_covidvaccinefinder_gov Walgreens Pharmacy (Bristol, Main St)
ct_covidvaccinefinder_gov:605a374c494866ed91e0808a ct_covidvaccinefinder_gov Walgreens Pharmacy (Terryville)
ct_covidvaccinefinder_gov:605a374d494866ed91e08095 ct_covidvaccinefinder_gov Walgreens Pharmacy (Thomaston)
ct_covidvaccinefinder_gov:605a3750494866ed91e080ba ct_covidvaccinefinder_gov Walgreens Pharmacy (Litchfield)
I'm going to exclude ct_covidvaccinefinder_gov
from this project for the moment.
Sample of the values of availability
and inventory
taken by sampling the most recent 10,000 source locations that have them:
So those don't always have supply levels.
Here's the scraper code that sets drop_in
: https://vaccinateca-ripgrep.datasette.app/-/ripgrep?pattern=drop_in&glob=
I ran this (took 20s so not through Django SQL Dashboard) to see which sources have both drop_in and appointment records:
select "source_name", count(*) as n from (with source_location_info as (
select
json_extract_path(import_json::json, 'availability', 'drop_in') as drop_in,
json_extract_path(import_json::json, 'availability', 'appointments') as appointments,
import_json,
id, source_uid, source_name, name, created_at, matched_location_id, last_imported_at
from
source_location
)
select
*
from
source_location_info
where
drop_in is not null and appointments is not null) as results group by "source_name" order by n desc
vaccinate_nj 1088
nyc_arcgis 590
sf_gov 74
mn_gov 22
dc_district 9
ct_state 5
Looks like vaccinefinder_org
scrapes only ever have the drop_in
true or false key, never the appointments
key.
select json_extract_path(import_json::json, 'availability')::text as availability, count(*)
from source_location where source_name = 'vaccinefinder_org'
group by json_extract_path(import_json::json, 'availability')::text
{"drop_in": false} 41252
{"drop_in": true} 8259
4584
I'm adding the following Location
columns, in order to better understand the source of data that we show to our users:
+ vaccines_offered_provenance_report = models.ForeignKey(
+ "Report",
+ null=True,
+ blank=True,
+ related_name="+",
+ help_text="The report that last populated vaccines_offered",
+ on_delete=models.PROTECT,
+ )
+ vaccines_offered_provenance_source_location = models.ForeignKey(
+ "SourceLocation",
+ null=True,
+ blank=True,
+ related_name="+",
+ help_text="The source location that last populated vaccines_offered",
+ on_delete=models.PROTECT,
+ )
+ vaccines_offered_last_updated_at = models.DateTimeField(
+ help_text="When vaccines_offered was last updated",
+ blank=True,
+ null=True,
+ )
+
+ appointments_walkins_provenance_report = models.ForeignKey(
+ "Report",
+ null=True,
+ blank=True,
+ related_name="+",
+ help_text="The report that last populated accepts_walkins and accepts_appointments",
+ on_delete=models.PROTECT,
+ )
+ appointments_walkins_provenance_source_location = models.ForeignKey(
+ "SourceLocation",
+ null=True,
+ blank=True,
+ related_name="+",
+ help_text="The source location that last populated accepts_walkins and accepts_appointments",
+ on_delete=models.PROTECT,
+ )
+ appointments_walkins_last_updated_at = models.DateTimeField(
+ help_text="When accepts_walkins and accepts_appointments were last updated",
+ blank=True,
+ null=True,
+ )
My code doesn't (yet) populate those new columns - that's what the save=False
parameter is going to do.
I'm pushing this live to staging and then I'll track down a bunch of interesting examples - locations with multiple reports and source locations - that I can use to demonstrate what the derive_availability_and_inventory()
method is currently doing.
Once I'm comfortable with the behaviour of that method, I'll do the following:
save=True
parameter to populate those new database columnsI want to find good example locations for this - locations that have both source locations AND reports against them which cover vaccines offered and availability.
Problem: we don't seem to have any on staging. Here's a query:
with last_1000_vaccine_source_locations as (
select * from source_location where json_extract_path(import_json::json, 'inventory') is not null
order by id desc limit 1000
),
all_reports_with_vaccines as (
select * from report where vaccines_offered is not null
)
select * from location where id in (
select matched_location_id from last_1000_vaccine_source_locations
) and id in (
select location_id from all_reports_with_vaccines
) limit 100
Since the code I've written so far is completely safe - it shows things on a debug page but doesn't update any database records - I'm going to ship it to production in order to see more examples.
My biggest question here is around the trustworthiness of our scrapers. We don't want to discard information from a high- trustworthy source because some other scraper submitted a more recent source location containing less useful data.
I think the fix for that is going to be allow-listing the scrapers - maybe even start with only vaccinefinder_gov
scrapers contributing to the result here. That's a small code change that can happen here:
I can use the new debug information on /location/x
to confirm if this is a good idea or not, by first identifying good examples of locations that have different source location source names with different opinions on vaccines and availability.
select location.public_id, count(*), array_agg(source_location.source_name), count(distinct source_location.source_name) as source_name_count
from source_location join location on source_location.matched_location_id = location.id
group by location.public_id
having count(*) > 1
order by source_name_count desc
Here's a top result from that: https://vial.calltheshots.us/location/lqwzd
This variant of that query returns only locations that also have at least one non-skip report:
select
location.public_id,
count(*) as num_source_locations,
array_agg(distinct source_location.source_name),
count(distinct source_location.source_name) as num_distinct_source_names
from
source_location
join location on source_location.matched_location_id = location.id
where
-- Only locations that have at least one non-skip report
location.id in (
select location_id from report where report.id not in (select report_id from call_report_availability_tag where availabilitytag_id = 20)
)
group by
location.public_id
having count(*) > 1
order by num_distinct_source_names desc
https://vial.calltheshots.us/location/lykhz is an interesting example:
https://vial.calltheshots.us/dashboard/?sql=select%20%22json_extract_path%22%2C%20count%28%2A%29%20as%20n%20from%20%28select%20json_extract_path%28import_json%3A%3Ajson%2C%20%27availability%27%29%3A%3Atext%20from%20source_location%20where%20source_name%20%3D%20%27getmyvax_org%27%29%20as%20results%20group%20by%20%22json_extract_path%22%20order%20by%20n%20desc%3APE6S3PfuMROXLq15I658NslW4tlc2PMasbC3HeyDqBI confirms that getmyvax_org
has NEVER returned a "dropins" key, it only ever populates appointments
:
Running this against the DB:
select "json_extract_path", count(*) as n from (select json_extract_path(import_json::json, 'availability')::text from source_location ) as results group by "json_extract_path" order by n desc
Returns this:
NULL 94952
{"appointments": true} 75839
{"drop_in": false} 41698
{"appointments": false} 12170
{"drop_in": true} 8684
{} 5143
{"drop_in": false, "appointments": true} 1288
{"drop_in": false, "appointments": false} 384
{"drop_in": true, "appointments": true} 102
{"drop_in": true, "appointments": false} 16
So the drop_in
key is set to true
for 8,000+ locations.
Maybe if the most recent source location has no explicit opinion on drop-ins we should fall back to the most recent report, if one exists?
I'm going to upgrade the display of that derived data on the /location
page to make it much easier to interpret.
Examples I need to find:
And some reports examples using the query from https://github.com/CAVaccineInventory/vial/issues/650#issuecomment-863518479
select
location.public_id,
count(*) as num_source_locations,
array_agg(distinct source_location.source_name),
count(distinct source_location.source_name) as num_distinct_source_names
from
source_location
join location on source_location.matched_location_id = location.id
where
-- Only locations that have at least one non-skip report
location.id in (
select location_id from report where report.id not in (select report_id from call_report_availability_tag where availabilitytag_id = 20)
)
group by
location.public_id
having count(*) > 1
and 'vaccinefinder_org' = any(array_agg(distinct source_location.source_name))
order by num_distinct_source_names desc
The and 'vaccinefinder_org' = any(array_agg(distinct source_location.source_name))
bit in the HAVING
clause returns locations that have at least one vaccinefinder_org
.
I'm going to wrap up this work up by:
vaccinefinder_org
and vaccinespotter_org
and getmyvax_org
, though maybe should use us_carbon_health
too?save=True
option on the location.derive_availability_and_inventory()
methodderive_availability_and_inventory(save=True)
on a bunch of locationsIt looks like us_carbon_health
doesn't populate either inventory
or availability
, see https://vial.calltheshots.us/dashboard/?sql=select%0D%0A++last_imported_at%2C%0D%0A++json_extract_path%28import_json%3A%3Ajson%2C+%27availability%27%29+as+availability%2C%0D%0A++json_extract_path%28import_json%3A%3Ajson%2C+%27inventory%27%29+as+inventory%0D%0Afrom+source_location%0D%0Awhere+source_name+%3D+%27us_carbon_health%27%0D%0Aorder+by+id+desc%0D%0Alimit+1000%3A2YDNN-RMQeRDRCIpqP4YznKqND7SV9_sYDCNqaqlVRQ - so I'll skip it for the moment.
Both vaccinefinder_org
and getmyvax_org
populate both inventory
and availability
.
The fact that vaccinespotter_org
doesn't populate inventory
does not mean we need to exclude it - this code here only considers source locations that DID populate inventory: https://github.com/CAVaccineInventory/vial/blob/9b7cc301c059be7c87cf3dbbe720c6d65a41ca05/vaccinate/core/models.py#L541-L546
Just realized I need to exclude reports with is_pending_review=True
as well. (Already excluding soft_deleted.)
I need to do another in-depth review of places that might add/remove/edit source locations and reports to make sure they all update the derived data correctly.
Here's a progress report on how population is going based on imported source locations and reports:
select count(*) from location where vaccines_offered_last_updated_at is not null
says 13387
select count(*) from location where appointments_walkins_last_updated_at is not null
says 13759
That's out of 77,067 not-soft-deleted locations.
I still need to run a back-fill mechanism for locations that haven't had a report or a source location import in the past week.
On staging https://vial-staging.calltheshots.us/dashboard/?sql=select+count%28%2A%29+from+location+where+vaccines_offered_last_updated_at+is+not+null%3Ax05v1QiDVM1fIntikQuIJSI7Umqolhu3YXvrdJsdPiU&sql=select+count%28%2A%29+from+location+where+appointments_walkins_last_updated_at+is+not+null%3AZPXVjIp8wjdVr8xjePGpvlDD4GyWUDSgnB8rbZbuPoQ both return around 9,000 records, presumably due to test source location imports run against staging.
Replaces #504 - needed by #649.
From https://docs.google.com/document/d/17svyCVXcloArj1wbUgu7QwEb6xeIDa-p7C3U0_yLgKg/edit