Open jingjtang opened 4 years ago
Hi there @jingjtang , thanks for the issue! We have recently converted to a new report and this is a new bug. I'm not sure yet where it comes from, but I'm going to try to solve it now as a few people have noted it. Thank you! jz
Hm, I just downloaded timeseries-byLocation.json from https://covidatlas.com/data (which links to a file on s3, https://liproduction-reportsbucket-bhk8fnhv1s76.s3-us-west-1.amazonaws.com/v1/latest/timeseries-byLocation.json), and its last date appears to be
"2020-07-13": {
"cases": 317167,
"deaths": 7017,
"tested": 3920501,
"hospitalized": 3591,
"recovered": 66625,
"icu": 532,
"growthFactor": 1
}
timeseries.csv linked on that same page does have the date you mentioned though:
MacBook-Air:Downloads jeff$ grep iso1:us#iso2:us-ca, timeseries.csv | tail -n 2
iso1:us#iso2:us-ca,california-us,"California, US",state,,,California,United States,37.25,-119.61,39512223,,America/Los_Angeles,485300,8901,117835,,5257026,4048,,,632,,2020-07-30
iso1:us#iso2:us-ca,california-us,"California, US",state,,,California,United States,37.25,-119.61,39512223,,America/Los_Angeles,485300,8901,117835,,5257026,4048,,,632,,2020-07-31
There's a few things to diagnose here, checking.
baseData.json, which is the base data source for all reports, has future data:
"2020-07-31": {
"cases": 485300,
"deaths": 8901,
"tested": 5257026,
"hospitalized": 4048,
"recovered": 117835,
"icu": 632,
"growthFactor": 1
}
The dateSources in that report shows
"2020-07-30..2020-07-31": "us-ca-mercury-news"
Checking that source to see what's up.
merc news has the following source: https://docs.google.com/spreadsheets/d/1CwZA4RPNf_hUrwzNLyGGNHRlh1cwl8vDHwIoae51Hac/gviz/tq?tqx=out:csv&sheet=timeseries
But that source currently has the latest date 07-29:
"County","Date","URL","Created","Last updated","Submitted By","Cases Total","Cases New Reported","Cases New Calculated","Cases Percent","Deaths Total","Deaths New","Tests Total","Tests New","Testing Turnaround","Pending Tests Current","Negative Tests Total","Negative Tests New","Positive Tests Total","Positive Tests New","Inconclusive Tests Total","Inconclusive Tests New","Recovered Total","Recovered New","Hospital Confirmed Total","Hospital Confirmed New","Hospital Confirmed Current","ICU Total","ICU New","ICU Current","Ventilator Total","Ventilator New","Ventilator Current","Symptomatic No Hospital Total","Symptomatic No Hospital New","Asymptomatic Total","Asymptomatic New","Hospital Suspected Total","Hospital Suspected New","Hospital Suspected Current","7-day new case average","7-day new death average","county population","14-day new cases","14-day new deaths","new daily testing (total or pos+neg)","total tests (total or calculated pos+neg)","7-day positivity rate","test rate 7-day average"
"Alameda","2020-07-29","http://www.acphd.org/2019-ncov.aspx","7/29/2020 12:44:12","7/29/2020 16:13:12","HR","10,773","","140","","181","0","","","","","","","","","","","","","","","","","","","","","","","","","","","","","161","1.428571429","1,685,886","13.6130201","27","0","0","","0.0000000"
"Alpine","2020-07-29","http://alpinecountyca.gov/Index.aspx?NID=516","7/29/2020 12:44:12","7/22/2020 18:15:35","","2","","0","","0","0","","","","","","","","","","","2","0","","","","","","","","","","","","","","","","","0","0","1,117","8.952551477","0","0","0","","0.0000000"
"Amador","2020-07-29","https://www.amadorgov.org/services/covid-19/-fsiteid-1","7/29/2020 12:44:12","7/28/2020 18:12:22","","89","","0","2.26%","0","0","3932","0","","","","","","","","","60","0","13","","4","","","","","","","","","","","","","","3","0","38,531","10.64078275","0","0","0","8.63%","1.0307100"
"Butte","2020-07-29","https://infogram.com/1pe66wmyjnmvkrhm66x9362kp3al60r57ex","7/29/2020 12:44:12","7/29/2020 16:46:04","EW","883","","17","4.97%","7","0","17755","0","","","16889","0","883","17","","","709","17","","","5","","","","","","","","","","","","","","29","0.2857142857","217,769","20.48041732","3","0","0","13.05%","0.9938447"
"Calaveras","2020-07-29","https://covid19.calaverasgov.us/","7/29/2020 12:44:12","7/28/2020 18:33:55","","108","","0","","1","0","","","","","","","","","","","80","0","","","1","","","","","","","","","","","","","","2","0","44,289","7.451060083","1","0","0","-0.34%","1.6740693"
"Colusa","2020-07-29","http://www.countyofcolusa.org/771/COVID19","7/29/2020 12:44:12","7/29/2020 16:51:13","EW","304","","12","","3","1","","","","","1768","4","304","12","","","214","5","","","4","","","","","","","","","","","","","","9","0.1428571429","22,593","72.58885496","3","0.7081839508","304","24.60%","1.5934139"
"Contra Costa","2020-07-29","https://cchealth.org/coronavirus/","7/29/2020 12:44:12","7/29/2020 12:45:36","HR","7,714","","410","5.74%","109","1","134,411","4,031","","","","","","","","","","","","","105","","","","","","","","","","","","","","216","1","1,160,099","22.18776156","17","3.474703452","4031","9.10%","2.0471159"
"Del Norte","2020-07-29","https://dnco.maps.arcgis.com/apps/opsdashboard/index.html#/3dd5de4df5194963853f7f40e38a3a01","7/29/2020 12:44:12","7/29/2020 18:09:50","EW","88","","0","2.54%","0","0","3466","-525","","","","","","","","","","","2","","0","","","","","","","","","","","","","","1","0","27,558","9.797517962","0","-19.05072937","-525","11.25%","0.4147098"
"El Dorado","2020-07-29","https://www.edcgov.us/Government/hhsa/Pages/EDCCOVID-19-Cases.aspx","7/29/2020 12:44:12","7/29/2020 18:09:31","EW","589","24","10","3.34%","1","0","17644","131","","","17055","121","589","10","","","386","16","","","1","","","1","","","","","","","","","","","15","0","193,098","11.70390165","1","0.6784119981","131","8.85%","0.8611467"
These return no records: grep 2020-07-30 data.csv
, grep 2020-07-31 data.csv
.
I recently updated merc news, so will check the old implementation to see if it messes up the dates.
Old code did a bad move with the data. e.g. running with the current data from the site, running scrape gives the following: 2020-07-30 is newer than last sample 2020-07-29. Using last sample anyway.
So, merc news is getting set as 2020-07-30 date in the data, even though there's only 2020-07-29.
Still doesn't explain 2020-07-31 showing up, still looking.
Running npm run gen-reports
only contains dates up to 07-30, nothing is being forward-dated.
It currently is Friday July 31 in a few areas of the world -- Tokyo, for example -- but honestly I'd be surprised if our main running timezone was ahead of us that much! Will check prod log.
Checking prod data first. eg below is check for 2020-07-29:
Have 07-29 data in table, 2020-07-30, and 2020-07-31 as well. Not good. Re-checking logs, couldn't see anything obvious though.
the 2020-07-31 data was updated 2020-07-30T12:38:47.347Z. in dynamodb. That is still 07-30 though, can't see why there would be another date recorded.
I'm not sure what is happening in the code that is causing this, which doesn't fill me with confidence! The only thought I have here is that the lambda doing the scraping is running in a different timezone, and so assigning a different date. I can't see how it's in such an advanced timezone. Unfortunately our logging is inadequate at the moment, so I can't see how this was set to the future date.
Regardless, a fix that I implemented recently should result in the data having the actual date specified in the data files. I'll keep this issue open until we see the change in effect.
@jingjtang - I'll assign this to you as well to do the check in a couple of days. I'll check too if I can, though I'm spread thin these days. I'll try clearing out the 07-31 data points for mercury-news, though that's a slow operation. :-)
Thanks again @jingjtang for the issue.
I've also pushed #365 to staging and prod, which had the same forward-dating bug. I believe that this will fix the issue. It may take a couple of days for us to know.
Dear friends in Corona Data Scraper groups, thank you so much for providing such a source. I am using your data (almost the timeseries.zip) for covid-19 related research. I find there is future data provided for California in the file which confuse me. For example, today is 07-30, but there are case numbers for California 07-31. Is there any mismatches between the cases/deaths/tested and the dates?