covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

Future data provided for California in timeseries.csv #360

Open jingjtang opened 4 years ago

jingjtang commented 4 years ago

Dear friends in Corona Data Scraper groups, thank you so much for providing such a source. I am using your data (almost the timeseries.zip) for covid-19 related research. I find there is future data provided for California in the file which confuse me. For example, today is 07-30, but there are case numbers for California 07-31. Is there any mismatches between the cases/deaths/tested and the dates?

jzohrab commented 4 years ago

Hi there @jingjtang , thanks for the issue! We have recently converted to a new report and this is a new bug. I'm not sure yet where it comes from, but I'm going to try to solve it now as a few people have noted it. Thank you! jz

jzohrab commented 4 years ago

Hm, I just downloaded timeseries-byLocation.json from https://covidatlas.com/data (which links to a file on s3, https://liproduction-reportsbucket-bhk8fnhv1s76.s3-us-west-1.amazonaws.com/v1/latest/timeseries-byLocation.json), and its last date appears to be

      "2020-07-13": {
        "cases": 317167,
        "deaths": 7017,
        "tested": 3920501,
        "hospitalized": 3591,
        "recovered": 66625,
        "icu": 532,
        "growthFactor": 1
      }

timeseries.csv linked on that same page does have the date you mentioned though:

 MacBook-Air:Downloads jeff$ grep iso1:us#iso2:us-ca, timeseries.csv | tail -n 2
iso1:us#iso2:us-ca,california-us,"California, US",state,,,California,United States,37.25,-119.61,39512223,,America/Los_Angeles,485300,8901,117835,,5257026,4048,,,632,,2020-07-30
iso1:us#iso2:us-ca,california-us,"California, US",state,,,California,United States,37.25,-119.61,39512223,,America/Los_Angeles,485300,8901,117835,,5257026,4048,,,632,,2020-07-31

There's a few things to diagnose here, checking.

jzohrab commented 4 years ago

baseData.json, which is the base data source for all reports, has future data:

      "2020-07-31": {
        "cases": 485300,
        "deaths": 8901,
        "tested": 5257026,
        "hospitalized": 4048,
        "recovered": 117835,
        "icu": 632,
        "growthFactor": 1
      }

The dateSources in that report shows

      "2020-07-30..2020-07-31": "us-ca-mercury-news"

Checking that source to see what's up.

jzohrab commented 4 years ago

merc news has the following source: https://docs.google.com/spreadsheets/d/1CwZA4RPNf_hUrwzNLyGGNHRlh1cwl8vDHwIoae51Hac/gviz/tq?tqx=out:csv&sheet=timeseries

But that source currently has the latest date 07-29:

"County","Date","URL","Created","Last updated","Submitted By","Cases Total","Cases New Reported","Cases New Calculated","Cases Percent","Deaths Total","Deaths New","Tests Total","Tests New","Testing Turnaround","Pending Tests Current","Negative Tests Total","Negative Tests New","Positive Tests Total","Positive Tests New","Inconclusive Tests Total","Inconclusive Tests New","Recovered Total","Recovered New","Hospital Confirmed Total","Hospital Confirmed New","Hospital Confirmed Current","ICU Total","ICU New","ICU Current","Ventilator Total","Ventilator New","Ventilator Current","Symptomatic No Hospital Total","Symptomatic No Hospital New","Asymptomatic Total","Asymptomatic New","Hospital Suspected Total","Hospital Suspected New","Hospital Suspected Current","7-day new case average","7-day new death average","county population","14-day new cases","14-day new deaths","new daily testing (total or pos+neg)","total tests (total or calculated pos+neg)","7-day positivity rate","test rate 7-day average"
"Alameda","2020-07-29","http://www.acphd.org/2019-ncov.aspx","7/29/2020 12:44:12","7/29/2020 16:13:12","HR","10,773","","140","","181","0","","","","","","","","","","","","","","","","","","","","","","","","","","","","","161","1.428571429","1,685,886","13.6130201","27","0","0","","0.0000000"
"Alpine","2020-07-29","http://alpinecountyca.gov/Index.aspx?NID=516","7/29/2020 12:44:12","7/22/2020 18:15:35","","2","","0","","0","0","","","","","","","","","","","2","0","","","","","","","","","","","","","","","","","0","0","1,117","8.952551477","0","0","0","","0.0000000"
"Amador","2020-07-29","https://www.amadorgov.org/services/covid-19/-fsiteid-1","7/29/2020 12:44:12","7/28/2020 18:12:22","","89","","0","2.26%","0","0","3932","0","","","","","","","","","60","0","13","","4","","","","","","","","","","","","","","3","0","38,531","10.64078275","0","0","0","8.63%","1.0307100"
"Butte","2020-07-29","https://infogram.com/1pe66wmyjnmvkrhm66x9362kp3al60r57ex","7/29/2020 12:44:12","7/29/2020 16:46:04","EW","883","","17","4.97%","7","0","17755","0","","","16889","0","883","17","","","709","17","","","5","","","","","","","","","","","","","","29","0.2857142857","217,769","20.48041732","3","0","0","13.05%","0.9938447"
"Calaveras","2020-07-29","https://covid19.calaverasgov.us/","7/29/2020 12:44:12","7/28/2020 18:33:55","","108","","0","","1","0","","","","","","","","","","","80","0","","","1","","","","","","","","","","","","","","2","0","44,289","7.451060083","1","0","0","-0.34%","1.6740693"
"Colusa","2020-07-29","http://www.countyofcolusa.org/771/COVID19","7/29/2020 12:44:12","7/29/2020 16:51:13","EW","304","","12","","3","1","","","","","1768","4","304","12","","","214","5","","","4","","","","","","","","","","","","","","9","0.1428571429","22,593","72.58885496","3","0.7081839508","304","24.60%","1.5934139"
"Contra Costa","2020-07-29","https://cchealth.org/coronavirus/","7/29/2020 12:44:12","7/29/2020 12:45:36","HR","7,714","","410","5.74%","109","1","134,411","4,031","","","","","","","","","","","","","105","","","","","","","","","","","","","","216","1","1,160,099","22.18776156","17","3.474703452","4031","9.10%","2.0471159"
"Del Norte","2020-07-29","https://dnco.maps.arcgis.com/apps/opsdashboard/index.html#/3dd5de4df5194963853f7f40e38a3a01","7/29/2020 12:44:12","7/29/2020 18:09:50","EW","88","","0","2.54%","0","0","3466","-525","","","","","","","","","","","2","","0","","","","","","","","","","","","","","1","0","27,558","9.797517962","0","-19.05072937","-525","11.25%","0.4147098"
"El Dorado","2020-07-29","https://www.edcgov.us/Government/hhsa/Pages/EDCCOVID-19-Cases.aspx","7/29/2020 12:44:12","7/29/2020 18:09:31","EW","589","24","10","3.34%","1","0","17644","131","","","17055","121","589","10","","","386","16","","","1","","","1","","","","","","","","","","","15","0","193,098","11.70390165","1","0.6784119981","131","8.85%","0.8611467"

These return no records: grep 2020-07-30 data.csv, grep 2020-07-31 data.csv.

I recently updated merc news, so will check the old implementation to see if it messes up the dates.

jzohrab commented 4 years ago

Old code did a bad move with the data. e.g. running with the current data from the site, running scrape gives the following: 2020-07-30 is newer than last sample 2020-07-29. Using last sample anyway. So, merc news is getting set as 2020-07-30 date in the data, even though there's only 2020-07-29.

Still doesn't explain 2020-07-31 showing up, still looking.

jzohrab commented 4 years ago

Running npm run gen-reports only contains dates up to 07-30, nothing is being forward-dated.

It currently is Friday July 31 in a few areas of the world -- Tokyo, for example -- but honestly I'd be surprised if our main running timezone was ahead of us that much! Will check prod log.

jzohrab commented 4 years ago

Checking prod data first. eg below is check for 2020-07-29:

image

Have 07-29 data in table, 2020-07-30, and 2020-07-31 as well. Not good. Re-checking logs, couldn't see anything obvious though.

jzohrab commented 4 years ago

the 2020-07-31 data was updated 2020-07-30T12:38:47.347Z. in dynamodb. That is still 07-30 though, can't see why there would be another date recorded.

jzohrab commented 4 years ago

I'm not sure what is happening in the code that is causing this, which doesn't fill me with confidence! The only thought I have here is that the lambda doing the scraping is running in a different timezone, and so assigning a different date. I can't see how it's in such an advanced timezone. Unfortunately our logging is inadequate at the moment, so I can't see how this was set to the future date.

Regardless, a fix that I implemented recently should result in the data having the actual date specified in the data files. I'll keep this issue open until we see the change in effect.

@jingjtang - I'll assign this to you as well to do the check in a couple of days. I'll check too if I can, though I'm spread thin these days. I'll try clearing out the 07-31 data points for mercury-news, though that's a slow operation. :-)

Thanks again @jingjtang for the issue.

jzohrab commented 4 years ago

I've also pushed #365 to staging and prod, which had the same forward-dating bug. I believe that this will fix the issue. It may take a couple of days for us to know.