Closed ivanoblomov closed 3 years ago
For Constraint 1, our data source is a static JS file on a .Net server:
16:23:39 krovat'~/Documents/code/vaccine-notifier (main)$ curl -I http://publichealth.lacounty.gov/acd/ncorona2019/js/pod-data.js
HTTP/1.1 200 OK
Content-Length: 341932
Content-Type: application/x-javascript
Last-Modified: Thu, 04 Mar 2021 02:56:37 GMT
Accept-Ranges: bytes
ETag: "2ea859fea110d71:0"
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Thu, 04 Mar 2021 21:26:40 GMT
Set-Cookie: ISD_cookie-encryption=!MROd5eoxP5tVUuj1OY+qJpYgVg4UfjmvE62YaUd1Bj/UGtJFNlhpD20hgdnQZE0q/7TBSPPZLKmMVac=; path=/; Httponly
Set-Cookie: visid_incap_2174300=MI78mBGESV+EgXjDg2uFsZJQQWAAAAAAQUIPAAAAAAAn0Z3Rw0EX9TLLPSqOKqJE; expires=Fri, 04 Mar 2022 08:40:34 GMT; HttpOnly; path=/; Domain=.lacounty.gov
Set-Cookie: incap_ses_1170_2174300=fyg6R/KmGgvzLZOXF608EJJQQWAAAAAAVfVvLP744uUBv5J8nYoG9Q==; path=/; Domain=.lacounty.gov
Set-Cookie: ___utmvmsFBuDPNZZ=yZqNiUJdWbZ; path=/; Max-Age=900
Set-Cookie: ___utmvasFBuDPNZZ=wkzPquY; path=/; Max-Age=900
Set-Cookie: ___utmvbsFBuDPNZZ=RZO
XdvOlalP: UtM; path=/; Max-Age=900
X-CDN: Imperva
X-Iinfo: 13-27532160-27532161 NNNN CT(75 -1 0) RT(1614893202085 0) q(0 0 1 1) r(2 2) U6
The source for http://publichealth.lacounty.gov/acd/ncorona2019/vaccine/hcwsignup/ (which references pod-data.js
above) shows the JS file reference. Oddly, it's commented as a "development version". Note the URL busts the cache with a random seed.
<!-- <script src="../../js/pod-data.js?version=4"></script>-->
<!-- Development version: -->
<script>
document.write('<script src="../../js/pod-data.js?dev=' + Math.floor(Math.random() * 1000) + '"\><\/script>');
</script>
I think the most relevant "change" would be new clinics that are added (or suddenly have a link). If we store the results of the last query in a database, can we compare the next query against those records and identify new clinics? We can then retrieve the zip code of the new clinics and DM only users who requested that zip code.
I like this idea. Simpler than the other possibilities while still being pretty useful. But note that if we want only the new ones, then we are assuming the primary keys/street addresses are stable. Haven't analyzed the data over time to confirm that. (I'm ruling out doing field comparisons here because the scaling problems it would pose seem prohibitive to me.)
However, if we limit it to new clinics only but DM all the clinics for that zip, then we'd only need to persist the number of clinics per zip code and DM them all if that number increases. I'd favor this route not only because it's simpler to implement, but it's probably a better UX anyway, ie: users don't have to look back in their history to see what the "old" ones were.
I really like the approach of counting clinics per zip as a proxy for changes worth DMing about. I did a comparison between a local scraped that I cached from 3/1 compared to today, 16 days later. Of 321 total clinics, 133 new clinics were added between the dates, 3 were removed, and only 1 site changed addresses.
The changed site was Pomona Fairplex from 2370 East Arrow Highway, (Gate 15) to 2352 Arrow Hwy (Gate 15). I assume that the address change is trivial, given that Pomona Fairplex is a stationary site.
Thanks for doing the audit. That's great news! Was gonna set up some batch jobs to take snapshots of the data, so you saved me a lot of work. Could you also review your snapshots to see what few primary keys there are stay consistent? I'd have no reason to think otherwise, but I also don't want to make any assumptions considering the ramifications if we were wrong.
@wraasch assigning this to you since you're looking into how LA County's primary keys change.
Just reviewed primary keys between today 3/19 and 3/7 (the cached data from my earlier analysis). Overall, only 1 key changed, and it was actually the parent key (xParent) for a set for the following records, which changed from 2 to 15:
That’s great! Was going to ask whether you meant the “id” field, but I’m thinking we should probably persist that as la_id
anyway to guarantee there are no collisions (since the Rails’ default for primary keys are 0-indexed integers and would therefore overlap with LA County’s keys).
On Mar 19, 2021, at 6:09 PM, wraasch @.***> wrote:
Just reviewed primary keys between today 3/19 and 3/7 (the cached data from my earlier analysis). Overall, only 1 key changed, and it was actually the parent key (xParent) for a set for the following records, which changed from 2 to 15:
Crenshaw Clinic San Fernando Clinic Lincoln Park Clinic Hansen Dam Recreational Center Dodger Stadium — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ivanoblomov/vaccine-notifier/issues/10#issuecomment-803162418, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA3ZEOQXLMDTJBEUIFS6D3TEPDP3ANCNFSM4ZI74SXQ.
Closing since we've made the necessary changes. Feel free to reopen, or open another ticket, if there's something we still need to add.
Data issues
Constraints
Current solution
LocationSyncer
:pod-data.js
and saves them as discreteLocations
in Postgres.Locations
in sync with any changes to the file.