covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

Add `priority: 1` to state scrapers to avoid over-reliance on `us-covidtracking` source as state data source #318

Open jzohrab opened 4 years ago

jzohrab commented 4 years ago

NOTE: we need to implement https://github.com/covidatlas/li/issues/196 as a pre-requisite for this to update the dynamoDB records.

While solving issue #313 ("tested starting on 2020-07-09"), I saw that many states are almost exclusively using us-covidtracking as their data source. I pulled down the file https://liproduction-reportsbucket-bhk8fnhv1s76.s3-us-west-1.amazonaws.com/beta/latest/timeseries-byLocation.json, and got the below results:


# Note that "X..Y: us-covidtracking" means that
# us-covidtracking supplied all data, from date X to date Y.

...
  {
    locid: iso1:us#iso2:us-wv,
    sources: {
      2020-01-24..2020-03-05: {},
      2020-03-06..2020-07-11: us-covidtracking,
      2020-07-12: us-wv,
      2020-07-13: jhu-usa
    }
  },
  {
    locid: iso1:us#iso2:us-wy,
    sources: {
      2020-01-24..2020-03-06: {},
      2020-03-07..2020-07-11: us-covidtracking,
      2020-07-12: {jhu-usa:[deaths],us-wy:[cases]},
      2020-07-13: jhu-usa
    }
  }

by using this script:

let j = require('./timeseries-byLocation.json')
let us = j.filter(n => n.countryID.toLowerCase() === 'iso1:us')
let ss = us.filter(n => n.locationID.split('#').length === 2).
    map(n => {
      return {
        locid: n.locationID,
        // timeseries: n.timeseries,
        sources: n.timeseriesSources
      }
    })

function replacer(key, value) {
  if (key.match(/\d{4}-\d{2}-\d{2}/))
    return JSON.stringify(value)
  return value
}

console.log(JSON.stringify(ss, replacer, 2).replace(/"/g, '').replace(/\\/g, ''))

Many other states are similar.

This means that covidtracking is overriding our state-specific scrapers at the moment. e.g., for Wyoming (us-wy), the us/wy source is only used as the source for a single day!

covidtracking has priority 0.5, but many state scrapers have priorities (after updating master, run npm run list-sources to get the following):


> li@1.0.6 list-sources /Users/jeff/Documents/Projects/li
> node src/shared/sources/_lib/list-sources.js

New entry /Users/jeff/Documents/Projects/li/src/shared/sources/at/index.js for key path
New entry at/index.js for key shortPath
New entry au-act for key key
New entry /Users/jeff/Documents/Projects/li/src/shared/sources/au/act/index.js for key path
New entry au/act/index.js for key shortPath
New entry qgolsteyn, camjc for key maintainers
New entry herbcaudill, camjc for key maintainers
New entry qgolsteyn, camjc, jzohrab for key maintainers
New entry jhu-usa for key key
New entry us-ca-butte-county for key key
New entry /Users/jeff/Documents/Projects/li/src/shared/sources/us/ca/butte-county.js for key path
New entry us/ca/butte-county.js for key shortPath
New entry us-ca-colusa-county for key key
New entry /Users/jeff/Documents/Projects/li/src/shared/sources/us/ca/colusa-county.js for key path
New entry us/ca/colusa-county.js for key shortPath
New entry us-ca-contra-costa-county for key key
New entry /Users/jeff/Documents/Projects/li/src/shared/sources/us/ca/contra-costa-county.js for key path
New entry us/ca/contra-costa-county.js for key shortPath
New entry /Users/jeff/Documents/Projects/li/src/shared/sources/us/ca/los-angeles-county/index.js for key path
New entry us/ca/los-angeles-county/index.js for key shortPath
New entry us-ca-san-francisco-county for key key
New entry us-ca-san-luis-obispo-county for key key
Source ID                     priority  shared/sources                     
---------                     --------  --------------                     
at                            1         at/index.js                        
au-act                        2         au/act/index.js                    
au                            1         au/index.js                        
au-nsw                        2         au/nsw/index.js                    
au-nt                         2         au/nt/index.js                     
au-qld                                  au/qld/index.js                    
au-sa                                   au/sa/index.js                     
au-tas                        2         au/tas/index.js                    
au-vic                                  au/vic/index.js                    
au-wa                         2         au/wa/index.js                     
be                            1         be/index.js                        
br                            1         br/index.js                        
ca                            1         ca/index.js                        
ca-ns                                   ca/ns/index.js                     
ch                            1         ch/index.js                        
cn                            1         cn/index.js                        
cy                            1         cy/index.js                        
cz                            1         cz/index.js                        
de                                      de/index.js                        
ee                                      ee/index.js                        
es                            1         es/index.js                        
fr                            1         fr/index.js                        
gb-eng                                  gb/eng/index.js                    
gb-sct                                  gb/sct/index.js                    
hk                            1         hk/index.js                        
id                                      id/index.js                        
ie                                      ie/index.js                        
in                                      in/index.js                        
it                                      it/index.js                        
jhu-usa                       -1        jhu-usa.js                         
jhu                           -1        jhu.js                             
jp                            1         jp/index.js                        
kr                            1         kr/index.js                        
lc                                      lc/index.js                        
lt                            1         lt/index.js                        
lv                            1         lv/index.js                        
mm                            1         mm/index.js                        
my                            1         my/index.js                        
ng                            1         ng/index.js                        
nl                            1         nl/index.js                        
nyt                           -1        nyt/index.js                       
nz                                      nz/index.js                        
pl                                      pl/index.js                        
pr                                      pr/index.js                        
ru                            1         ru/index.js                        
sa                            1         sa/index.js                        
se                            1         se/index.js                        
si                            1         si/index.js                        
th                            1         th/index.js                        
tw                            1         tw/index.js                        
ua                            1         ua/index.js                        
us-al                                   us/al/index.js                     
us-ar                                   us/ar/index.js                     
us-az                                   us/az/index.js                     
us-ca-butte-county                      us/ca/butte-county.js              
us-ca-colusa-county                     us/ca/colusa-county.js             
us-ca-contra-costa-county               us/ca/contra-costa-county.js       
us-ca-fresno-county                     us/ca/fresno-county.js             
us-ca-kings-county            2         us/ca/kings-county/index.js        
us-ca-los-angeles-county      2         us/ca/los-angeles-county/index.js  
us-ca-mercury-news            1         us/ca/mercury-news.js              
us-ca-mono-county                       us/ca/mono-county.js               
us-ca-monterey-county                   us/ca/monterey-county.js           
us-ca-orange-county                     us/ca/orange-county.js             
us-ca-placer-county                     us/ca/placer-county.js             
us-ca-san-benito-county                 us/ca/san-benito-county.js         
us-ca-san-diego-county                  us/ca/san-diego-county.js          
us-ca-san-francisco-county              us/ca/san-francisco-county.js      
us-ca-san-joaquin-county                us/ca/san-joaquin-county.js        
us-ca-san-luis-obispo-county            us/ca/san-luis-obispo-county.js    
us-ca-san-mateo-county                  us/ca/san-mateo-county.js          
us-ca-shasta-county                     us/ca/shasta-county.js             
us-ca-solano-county                     us/ca/solano-county.js             
us-ca-sonoma-county           2         us/ca/sonoma-county/index.js       
us-ca-stanislaus-county                 us/ca/stanislaus-county.js         
us-ca-ventura-county                    us/ca/ventura-county.js            
us-co                         1         us/co/index.js                     
us-covidtracking              0.5       us/covidtracking.js                
us-ct                                   us/ct/index.js                     
us-de                                   us/de/index.js                     
us-fl                         1         us/fl/index.js                     
us-ga                                   us/ga/index.js                     
us-hi                         1         us/hi/index.js                     
us-ia                                   us/ia/index.js                     
us-il                         1         us/il/index.js                     
us-in                         1         us/in/index.js                     
us-me                                   us/me/index.js                     
us-mi                                   us/mi/index.js                     
us-nc                         1         us/nc/index.js                     
us-nd                         1         us/nd/index.js                     
us-nh                                   us/nh/index.js                     
us-nj                                   us/nj/index.js                     
us-nm                         1         us/nm/index.js                     
us-nv-carson-city                       us/nv/carson-city.js               
us-nv-clark-county                      us/nv/clark-county.js              
us-nv-nye-county                        us/nv/nye-county.js                
us-nv-washoe-county                     us/nv/washoe-county.js             
us-ny                                   us/ny/index.js                     
us-oh                                   us/oh/index.js                     
us-ok                         1         us/ok/index.js                     
us-or                         2         us/or/index.js                     
us-pa                                   us/pa/index.js                     
us-ri                         1         us/ri/index.js                     
us-sc                                   us/sc/index.js                     
us-sd                                   us/sd/index.js                     
us-tn                                   us/tn/index.js                     
us-tx                                   us/tx/index.js                     
us-ut                                   us/ut/index.js                     
us-va                                   us/va/index.js                     
us-vt                                   us/vt/index.js                     
us-wa                                   us/wa/index.js                     
us-wi                                   us/wi/index.js                     
us-wv                                   us/wv/index.js                     
us-wy                                   us/wy/index.js                     
vi                                      vi/index.js                        
vn                            1         vn/index.js                        
za                            1         za/index.js                        
jzohrab commented 4 years ago

When this change is made, we'll need to somehow force a full scrape of all of the changed scrapers, because the priority is recorded in the underlying dynamoDB record. Issue https://github.com/covidatlas/li/issues/196 is required for this.