CAVaccineInventory / vial

The Django application powering calltheshots.us
https://vial.calltheshots.us
MIT License
13 stars 1 forks source link

Identify locations that existed in a previous scraper run but do not exist any more #712

Open simonw opened 3 years ago

simonw commented 3 years ago

For #704 it's likely that one of the strongest signals we can get for if a location has shut down is if it no longer appears in our source location scraped data.

Can we use SQL to notice locations that appear to no longer be picked up by our scrapers?

simonw commented 3 years ago

Here's an interesting query: https://vial.calltheshots.us/dashboard/?sql=with+most_recent_import+as+(%0D%0A++select+source_name%2C+max(last_imported_at)+as+most_recent_for_source_location%0D%0A++from+source_location%0D%0A++group+by+source_name%0D%0A)%0D%0Aselect%0D%0A++source_location.source_name%2C+source_uid%2C+last_imported_at%2C+most_recent_for_source_location%0D%0Afrom%0D%0A++source_location+join+most_recent_import+on+source_location.source_name+%3D+most_recent_import.source_name%0D%0Awhere+--+it%27s+more+than+24+hours+away+from+the+most+recent+import%0D%0A++most_recent_for_source_location+-+INTERVAL+%272+DAYS%27+%3E+last_imported_at%3AGExHMWMOpNFEHojcnv-TDZOmruJEIOAcUFgvBye2yHw

with most_recent_import as (
  select source_name, max(last_imported_at) as most_recent_for_source_location
  from source_location
  group by source_name
)
select
  source_location.source_name, source_uid, last_imported_at, most_recent_for_source_location
from
  source_location join most_recent_import on source_location.source_name = most_recent_import.source_name
where -- it's more than 24 hours away from the most recent import
  most_recent_for_source_location - INTERVAL '2 DAYS' > last_imported_at

For each of our source_name (which should correspond to a scraper) we find the most recent last_imported_at date.

Then we look for source locations with that source name which have a last_imported_at date more than 2 days prior to that most recent date.

simonw commented 3 years ago

A count(*) against that query returns 206,254 - that's out of a total of 241,631 records in source_location total. That seems suspect - do we really have a vast majority of our source location records that far out of date?

simonw commented 3 years ago

Version of that query which only looks at source locations for sources that have been scraped within the past 7 days (based on their most_recent_for_source_location) value:

with most_recent_import as (
  select source_name, max(last_imported_at) as most_recent_for_source_location
  from source_location
  group by source_name
)
select
  count(*)
from
  source_location join most_recent_import on source_location.source_name = most_recent_import.source_name
where -- it's more than 24 hours away from the most recent import
  most_recent_for_source_location - INTERVAL '2 DAYS' > last_imported_at
  and now() - INTERVAL '7 DAYS' < most_recent_for_source_location

https://vial.calltheshots.us/dashboard/?sql=with+most_recent_import+as+(%0D%0A++select+source_name%2C+max(last_imported_at)+as+most_recent_for_source_location%0D%0A++from+source_location%0D%0A++group+by+source_name%0D%0A)%0D%0Aselect%0D%0A++source_location.source_name%2C+source_uid%2C+last_imported_at%2C+most_recent_for_source_location%0D%0Afrom%0D%0A++source_location+join+most_recent_import+on+source_location.source_name+%3D+most_recent_import.source_name%0D%0Awhere+--+it%27s+more+than+24+hours+away+from+the+most+recent+import%0D%0A++most_recent_for_source_location+-+INTERVAL+%272+DAYS%27+%3E+last_imported_at%0D%0A++and+now()+-+INTERVAL+%277+DAYS%27+%3C+most_recent_for_source_location%3AdZ49ezANY9TzSMYoQ6CZMrwtnz1OMkLpQsoBG-qi-Oo

Still returns 116,395 records.

I'm pausing this research to work on other things. I'm not at all convinced I've been running the right queries, so don't take anything in this issue thread up to this point as factual and not the result of one or more mistakes.

simonw commented 3 years ago

Here's a parameterized query that takes a number of days and a source_name and shows results with an additional column for whether or not they are out of date based on that number of days: https://vial.calltheshots.us/dashboard/?sql=with+most_recent_import+as+%28%0D%0A++select+source_name%2C+max%28last_imported_at%29+as+most_recent_for_source_location%0D%0A++from+source_location%0D%0A++group+by+source_name%0D%0A%29%0D%0Aselect%0D%0A++source_location.source_name%2C+source_uid%2C+last_imported_at%2C+most_recent_for_source_location%2C+most_recent_for_source_location+-+INTERVAL+%271+DAY%27+%2A+%25%28days%29s+%3E+last_imported_at+as+out_of_date%0D%0Afrom%0D%0A++source_location+join+most_recent_import+on+source_location.source_name+%3D+most_recent_import.source_name%0D%0Awhere%0D%0A++source_location.source_name+%3D+%25%28source_name%29s%3A3K0sZYmF0NHG9bmGn6uLlfz5GeNv8MMJsnxpkQImk8s&days=13&source_name=vaccinefinder_org

simonw commented 3 years ago

Here's why the results look odd: the scrapers have an optimization where if a record hadn't changed they don't send it to VIAL at all.

Discussed here: https://discord.com/channels/799147121357881364/813861006718926848/859865627191410700

One possible solution: teach the scrapers to send a special minimal document that tells VIAL "this location is still in the feed but has not changed since last time"

Maybe a special shape of document that gets sent to /api/importSourceLocations - {"source_name": "blah", "source_uid": "blah:123", "unchanged": true} could work

This could be used to populate a last_seen_at column.

simonw commented 3 years ago

I feel pretty good about this as a way for us to start reliably removing locations that are no longer active - we can back it up with a dashboard showing "recently deactivated locations", plus some kind of manual override for it we want to keep a location live even if a feed has stopped returning it.

Maybe we have an allow-list or scrapers that we trust and cause locations to be automatically deactivated - vaccinefinder_org for example.

And a human review queue for locations that go missing from other less trusted scrapers.