FoveaCentral / vaccinesignup

A Twitter bot that notifies users about available vaccine appointments
https://twitter.com/vaccinesignup
Creative Commons Zero v1.0 Universal
1 stars 0 forks source link

How to handle limitations in LA County's data? #10

Closed ivanoblomov closed 3 years ago

ivanoblomov commented 3 years ago

Data issues

Constraints

  1. Because LA County publishes their appointment-location data as a monolithic JavaScript file, there is no good way to detect updates without:
    1. persisting the data
    2. comparing individual fields
  2. Worse, two thirds of the locations have no primary keys, which again requires doing field comparisons when checking for updates.
  3. What constitutes a change to notify?
    1. Changes to a previously tweeted location?
    2. Addition of a new location?
    3. Deletion of a previously tweeted location?
    4. Note that detecting any of the above changes will effectively require recording state for every DM made:
      1. Per user
      2. Per location
      3. And potentially per field

Current solution

LocationSyncer:

  1. reads appointment-locations from pod-data.js and saves them as discrete Locations in Postgres.
  2. uses primary keys to determine identity when available or the street address when no key exists.
  3. runs periodically to keep Locations in sync with any changes to the file.
ivanoblomov commented 3 years ago

For Constraint 1, our data source is a static JS file on a .Net server:

16:23:39 krovat'~/Documents/code/vaccine-notifier (main)$ curl -I http://publichealth.lacounty.gov/acd/ncorona2019/js/pod-data.js
HTTP/1.1 200 OK
Content-Length: 341932
Content-Type: application/x-javascript
Last-Modified: Thu, 04 Mar 2021 02:56:37 GMT
Accept-Ranges: bytes
ETag: "2ea859fea110d71:0"
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Thu, 04 Mar 2021 21:26:40 GMT
Set-Cookie: ISD_cookie-encryption=!MROd5eoxP5tVUuj1OY+qJpYgVg4UfjmvE62YaUd1Bj/UGtJFNlhpD20hgdnQZE0q/7TBSPPZLKmMVac=; path=/; Httponly
Set-Cookie: visid_incap_2174300=MI78mBGESV+EgXjDg2uFsZJQQWAAAAAAQUIPAAAAAAAn0Z3Rw0EX9TLLPSqOKqJE; expires=Fri, 04 Mar 2022 08:40:34 GMT; HttpOnly; path=/; Domain=.lacounty.gov
Set-Cookie: incap_ses_1170_2174300=fyg6R/KmGgvzLZOXF608EJJQQWAAAAAAVfVvLP744uUBv5J8nYoG9Q==; path=/; Domain=.lacounty.gov
Set-Cookie: ___utmvmsFBuDPNZZ=yZqNiUJdWbZ; path=/; Max-Age=900
Set-Cookie: ___utmvasFBuDPNZZ=wkzPquY; path=/; Max-Age=900
Set-Cookie: ___utmvbsFBuDPNZZ=RZO
    XdvOlalP: UtM; path=/; Max-Age=900
X-CDN: Imperva
X-Iinfo: 13-27532160-27532161 NNNN CT(75 -1 0) RT(1614893202085 0) q(0 0 1 1) r(2 2) U6

The source for http://publichealth.lacounty.gov/acd/ncorona2019/vaccine/hcwsignup/ (which references pod-data.js above) shows the JS file reference. Oddly, it's commented as a "development version". Note the URL busts the cache with a random seed.

<!-- <script src="../../js/pod-data.js?version=4"></script>-->
<!-- Development version: -->
<script>
  document.write('<script src="../../js/pod-data.js?dev=' + Math.floor(Math.random() * 1000) + '"\><\/script>');
</script>
wraasch commented 3 years ago

I think the most relevant "change" would be new clinics that are added (or suddenly have a link). If we store the results of the last query in a database, can we compare the next query against those records and identify new clinics? We can then retrieve the zip code of the new clinics and DM only users who requested that zip code.

ivanoblomov commented 3 years ago

I like this idea. Simpler than the other possibilities while still being pretty useful. But note that if we want only the new ones, then we are assuming the primary keys/street addresses are stable. Haven't analyzed the data over time to confirm that. (I'm ruling out doing field comparisons here because the scaling problems it would pose seem prohibitive to me.)

However, if we limit it to new clinics only but DM all the clinics for that zip, then we'd only need to persist the number of clinics per zip code and DM them all if that number increases. I'd favor this route not only because it's simpler to implement, but it's probably a better UX anyway, ie: users don't have to look back in their history to see what the "old" ones were.

wraasch commented 3 years ago

I really like the approach of counting clinics per zip as a proxy for changes worth DMing about. I did a comparison between a local scraped that I cached from 3/1 compared to today, 16 days later. Of 321 total clinics, 133 new clinics were added between the dates, 3 were removed, and only 1 site changed addresses.

The changed site was Pomona Fairplex from  2370 East Arrow Highway, (Gate 15) to  2352 Arrow Hwy (Gate 15). I assume that the address change is trivial, given that Pomona Fairplex is a stationary site.

ivanoblomov commented 3 years ago

Thanks for doing the audit. That's great news! Was gonna set up some batch jobs to take snapshots of the data, so you saved me a lot of work. Could you also review your snapshots to see what few primary keys there are stay consistent? I'd have no reason to think otherwise, but I also don't want to make any assumptions considering the ramifications if we were wrong.

ivanoblomov commented 3 years ago

@wraasch assigning this to you since you're looking into how LA County's primary keys change.

wraasch commented 3 years ago

Just reviewed primary keys between today 3/19 and 3/7 (the cached data from my earlier analysis). Overall, only 1 key changed, and it was actually the parent key (xParent) for a set for the following records, which changed from 2 to 15:

ivanoblomov commented 3 years ago

That’s great! Was going to ask whether you meant the “id” field, but I’m thinking we should probably persist that as la_id anyway to guarantee there are no collisions (since the Rails’ default for primary keys are 0-indexed integers and would therefore overlap with LA County’s keys).

On Mar 19, 2021, at 6:09 PM, wraasch @.***> wrote:

Just reviewed primary keys between today 3/19 and 3/7 (the cached data from my earlier analysis). Overall, only 1 key changed, and it was actually the parent key (xParent) for a set for the following records, which changed from 2 to 15:

Crenshaw Clinic San Fernando Clinic Lincoln Park Clinic Hansen Dam Recreational Center Dodger Stadium — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ivanoblomov/vaccine-notifier/issues/10#issuecomment-803162418, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA3ZEOQXLMDTJBEUIFS6D3TEPDP3ANCNFSM4ZI74SXQ.

ivanoblomov commented 3 years ago

Closing since we've made the necessary changes. Feel free to reopen, or open another ticket, if there's something we still need to add.