covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

Missing per-county tested data #356

Open TomGoBravo opened 4 years ago

TomGoBravo commented 4 years ago

Additional context

CovidActNow has been regularly fetching this file for months and making a copy at https://github.com/covid-projections/covid-data-public/commits/master/data/cases-cds/timeseries.csv

With the change to Project Li I noticed that many counties that used to have values in the tested column have no data now. The problem seems to be particularly bad in Pennsylvania.

TomGoBravo commented 4 years ago

Actually, looking at

git show `git rev-list -n 1 --first-parent --before="2020-07-24" master`:data/cases-cds/timeseries.csv | git lfs smudge | csvgrep -c state -m Pennsylvania | csvgrep -c tested -r . | csvcut -c name,level,date | csvsort -c date | csvlook |less

I see that for Pennsylvania counties the newest date with data is from 2020-06-07

TomGoBravo commented 4 years ago

But there are many counties where Corona Data Scraper had cases data for July but project Li has none. I'm looking at rows of data we fetched last week

(this creates a file with the names of every county in the US with cases in 2020-07 in data merged to https://github.com/covid-projections/covid-data-public last Friday)

git show `git rev-list -n 1 --first-parent --before="2020-07-24" master`:data/cases-cds/timeseries.csv | git lfs smudge | csvgrep -c level -m county | csvgrep -c country -m "United States" | csvgrep -c date -r '2020-07-..' | csvgrep -c cases -r . | csvcut -c name | perl -pe 's/United States/US/' | sort | uniq > data-20200724/cases-cds/timeseries-counties-cases-uniq

and comparing them to data a similar file created from data fetched from https://coronadatascraper.com/timeseries.csv.zip today:

cat cds/timeseries.csv | csvgrep -c level -m county | csvgrep -c country -m "United States" | csvgrep -c date -r '2020-07-..' | csvgrep -c cases -r . | csvcut -c name | sort | uniq > cds/timeseries-counties-cases-uniq

It looks like there are 851 counties that lost cases and 21 that got it.

TomGoBravo commented 4 years ago

Here are 4 examples:

diff data-20200724/cases-cds/timeseries-counties-cases-uniq cds/timeseries-counties-cases-uniq  | grep Brown
< "Brown County, Kansas, US"
< "Brown County, Minnesota, US"
< "Brown County, South Dakota, US"
< "Brown County, Texas, US"

Looking at csvgrep -c name -m 'Brown County, Texas' data-20200724/cases-cds/timeseries.csv |csvcut -c name,date,cases it seems like the timeseries of cases was legit. It goes up to 303 on 2020-07-23 and agrees with https://ktxs.com/news/local/brown-county-12-new-cases-of-covid-19-2-deaths (99 cases on 2020-07-04).

jzohrab commented 4 years ago

Hi @TomGoBravo , thx for the great notes. It looks like this is a result of a few things:

BE/index.js
CA/NS/index.js
CH/index.js
FR/index.js
PA/index.js
US/AZ/index.js
US/CA/mercury-news.js
US/CA/san-francisco-county.js
US/CA/santa-clara-county.js
US/DC/index.js
US/KS/index.js   <<<
US/LA/index.js
US/MO/index.js
US/NV/washoe-county/index.js
US/TX/harris-county.js

Some of the ported sources are currently failing in live: ref https://api.covidatlas.com/status?format=html.

us-pa

I'll try fixing PA first, and see where that takes us.

jzohrab commented 4 years ago

Re "I see that for Pennsylvania counties the newest date with data is from 2020-06-07" - checking code comments and issues - we had an issue for that, https://github.com/covidatlas/coronadatascraper/issues/1055. PA changed their reporting to now use PDFs.

Code in src/shared/sources/us/pa/index.js has a comment:

    // TODO (scrapers) us-pa stopped working 2020-06-08
    // ref https://github.com/covidatlas/coronadatascraper/issues/1055
    // Now data is present in PDFs at links on
    // https://www.health.pa.gov/topics/disease/coronavirus/Pages/Cases.aspx

I'll switch from PA to KS first (one of the Brown County items you listed) to see if I can get that working.

jzohrab commented 4 years ago

Blarg, running into issues with getting KS to work. Similar to PA, KS switched to reporting stuff via PDFs, and for some reason the PDF code is not working -- have hacked around and can't grok it just yet. Will raise another issue for it.

martynwong commented 4 years ago

I've also found a similar issue - lots of county-level data in the central US appears to be missing.

jzohrab commented 4 years ago

Hi all, I believe I've found the reason for missing data, though I'm not sure what caused the cause.

Our reports are built up by location, stored in the Locations table. I checked the production table, and we don't have brown-county-texas-us (locationID iso1:us#iso2:us-tx#fips:48049), but we do have brown-county-illinois-us (iso1:us#iso2:us-il#fips:17009).

I'm not sure why that's the case -- the location data should be populated when data is scraped. We do have data for the brown-country-texas-us location:

"locationID (S)","dateSource (S)","cases (N)","country (S)","county (S)","date (S)","deaths (N)","priority (N)","source (S)","state (S)","updated (S)"
"iso1:us#iso2:us-tx#fips:48049","2020-07-01#jhu-usa","77","iso1:US","fips:48049","2020-07-01","10","-1","jhu-usa","iso2:US-TX","2020-08-02T10:08:53.548Z"

I'll look into a manual load of location data ... I don't know why we're loading locations during data scrape anyway, as we already have all of the location data.

ps - I haven't bothered looking into the other missing counties -- thanks for the list above -- but it seems highly likely this is the problem.

jzohrab commented 4 years ago

The "locations" lambda (which updates locations) appears to have been timing out. For most sources it's ok, but for something like jhu-usa, which updates thousands of locations, it fails. Local logging:

updating 153 of 3277: iso1:us#iso2:us-ak#fips:02050
updating 154 of 3277: iso1:us#iso2:us-ak#fips:02060

and it stops. I see errors in the lambda log, and am assuming it's that.

I bumped up the timeout for the lambda. Updating all locations takes about 1.5 mins for jhu-usa locally. Simplifying the code slightly now.

jzohrab commented 4 years ago

I believe this will be addressed by https://github.com/covidatlas/li/pull/367. I'll launch that to production soon (< 15 mins). We'll need to wait for a jhu-usa scrape to update all of the locations.

jzohrab commented 4 years ago

Launched to prod ... let's see how things shake out.

jzohrab commented 4 years ago

Also assigning @TomGoBravo and @martynwong , if you see the data has filled in before I do, please close the issue. Cheers! jz

martynwong commented 4 years ago

Hurrah! The data is working for me. Thanks!