Open TomGoBravo opened 4 years ago
Actually, looking at
git show `git rev-list -n 1 --first-parent --before="2020-07-24" master`:data/cases-cds/timeseries.csv | git lfs smudge | csvgrep -c state -m Pennsylvania | csvgrep -c tested -r . | csvcut -c name,level,date | csvsort -c date | csvlook |less
I see that for Pennsylvania counties the newest date with data is from 2020-06-07
But there are many counties where Corona Data Scraper had cases
data for July but project Li has none. I'm looking at rows of data we fetched last week
(this creates a file with the names of every county in the US with cases
in 2020-07
in data merged to https://github.com/covid-projections/covid-data-public last Friday)
git show `git rev-list -n 1 --first-parent --before="2020-07-24" master`:data/cases-cds/timeseries.csv | git lfs smudge | csvgrep -c level -m county | csvgrep -c country -m "United States" | csvgrep -c date -r '2020-07-..' | csvgrep -c cases -r . | csvcut -c name | perl -pe 's/United States/US/' | sort | uniq > data-20200724/cases-cds/timeseries-counties-cases-uniq
and comparing them to data a similar file created from data fetched from https://coronadatascraper.com/timeseries.csv.zip today:
cat cds/timeseries.csv | csvgrep -c level -m county | csvgrep -c country -m "United States" | csvgrep -c date -r '2020-07-..' | csvgrep -c cases -r . | csvcut -c name | sort | uniq > cds/timeseries-counties-cases-uniq
It looks like there are 851 counties that lost cases
and 21 that got it.
Here are 4 examples:
diff data-20200724/cases-cds/timeseries-counties-cases-uniq cds/timeseries-counties-cases-uniq | grep Brown
< "Brown County, Kansas, US"
< "Brown County, Minnesota, US"
< "Brown County, South Dakota, US"
< "Brown County, Texas, US"
Looking at csvgrep -c name -m 'Brown County, Texas' data-20200724/cases-cds/timeseries.csv |csvcut -c name,date,cases
it seems like the timeseries of cases was legit. It goes up to 303 on 2020-07-23 and agrees with https://ktxs.com/news/local/brown-county-12-new-cases-of-covid-19-2-deaths (99 cases on 2020-07-04).
Hi @TomGoBravo , thx for the great notes. It looks like this is a result of a few things:
BE/index.js
CA/NS/index.js
CH/index.js
FR/index.js
PA/index.js
US/AZ/index.js
US/CA/mercury-news.js
US/CA/san-francisco-county.js
US/CA/santa-clara-county.js
US/DC/index.js
US/KS/index.js <<<
US/LA/index.js
US/MO/index.js
US/NV/washoe-county/index.js
US/TX/harris-county.js
Some of the ported sources are currently failing in live: ref https://api.covidatlas.com/status?format=html.
us-pa
I'll try fixing PA first, and see where that takes us.
Re "I see that for Pennsylvania counties the newest date with data is from 2020-06-07" - checking code comments and issues - we had an issue for that, https://github.com/covidatlas/coronadatascraper/issues/1055. PA changed their reporting to now use PDFs.
Code in src/shared/sources/us/pa/index.js has a comment:
// TODO (scrapers) us-pa stopped working 2020-06-08
// ref https://github.com/covidatlas/coronadatascraper/issues/1055
// Now data is present in PDFs at links on
// https://www.health.pa.gov/topics/disease/coronavirus/Pages/Cases.aspx
I'll switch from PA to KS first (one of the Brown County items you listed) to see if I can get that working.
Blarg, running into issues with getting KS to work. Similar to PA, KS switched to reporting stuff via PDFs, and for some reason the PDF code is not working -- have hacked around and can't grok it just yet. Will raise another issue for it.
I've also found a similar issue - lots of county-level data in the central US appears to be missing.
Hi all, I believe I've found the reason for missing data, though I'm not sure what caused the cause.
Our reports are built up by location, stored in the Locations table. I checked the production table, and we don't have brown-county-texas-us (locationID iso1:us#iso2:us-tx#fips:48049), but we do have brown-county-illinois-us (iso1:us#iso2:us-il#fips:17009).
I'm not sure why that's the case -- the location data should be populated when data is scraped. We do have data for the brown-country-texas-us location:
"locationID (S)","dateSource (S)","cases (N)","country (S)","county (S)","date (S)","deaths (N)","priority (N)","source (S)","state (S)","updated (S)"
"iso1:us#iso2:us-tx#fips:48049","2020-07-01#jhu-usa","77","iso1:US","fips:48049","2020-07-01","10","-1","jhu-usa","iso2:US-TX","2020-08-02T10:08:53.548Z"
I'll look into a manual load of location data ... I don't know why we're loading locations during data scrape anyway, as we already have all of the location data.
ps - I haven't bothered looking into the other missing counties -- thanks for the list above -- but it seems highly likely this is the problem.
The "locations" lambda (which updates locations) appears to have been timing out. For most sources it's ok, but for something like jhu-usa, which updates thousands of locations, it fails. Local logging:
updating 153 of 3277: iso1:us#iso2:us-ak#fips:02050
updating 154 of 3277: iso1:us#iso2:us-ak#fips:02060
and it stops. I see errors in the lambda log, and am assuming it's that.
I bumped up the timeout for the lambda. Updating all locations takes about 1.5 mins for jhu-usa locally. Simplifying the code slightly now.
I believe this will be addressed by https://github.com/covidatlas/li/pull/367. I'll launch that to production soon (< 15 mins). We'll need to wait for a jhu-usa scrape to update all of the locations.
Launched to prod ... let's see how things shake out.
Also assigning @TomGoBravo and @martynwong , if you see the data has filled in before I do, please close the issue. Cheers! jz
Hurrah! The data is working for me. Thanks!
tested
dataAdditional context
CovidActNow has been regularly fetching this file for months and making a copy at https://github.com/covid-projections/covid-data-public/commits/master/data/cases-cds/timeseries.csv
With the change to Project Li I noticed that many counties that used to have values in the
tested
column have no data now. The problem seems to be particularly bad in Pennsylvania.