covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

Minnesota, Montana, Indiana numbers #483

Open BartJohnson opened 4 years ago

BartJohnson commented 4 years ago

bart.johnson@excelitas

Here is an example of the last 10 days of case numbers for a county in Minnesota. The numbers are constant, go up, go down, and then go up again. How can the numbers go down. Do you not update the numbers every day?

Minnesota fips:27163 Washington County, Minnesota, United States 266 266 266 266 266 266 553 266 266 576

Here's another example:

Minnesota fips:27145 Stearns County, Minnesota, United States 1512 1512 1512 1512 1512 1512 1959 1512 1512 1995

Did something different happen on 5/29/2020 (last number) and on 5/26/2020 (third to the last)? I seem to be seeing that pattern on most of the data sets.

jzohrab commented 4 years ago

Hi @BartJohnson , thanks for the issue. Checking.

Confirmed for MN Washington county (https://coronadatascraper.com/#timeseries-byLocation.json):

image

jzohrab commented 4 years ago

Reason for the data jumping:

Our primary data source for, say Minnesota, is the state data scraper (ref source code ). Per the code, that scraper hits an arcgis url:

https://services1.arcgis.com/RQG3sksSXcoDoIfj/arcgis/rest/services/MN_COVID19_County_Tracking_Public_View/FeatureServer/0/query?f=json&where=1%3D1&returnGeometry=false&outFields=*

We hit that URL and cached it successfully on 2020-05-28, with this record:

{"attributes":{"OBJECTID":49,"COUNTYFIPS":27163,"CTY_NAME":"Washington","COVID19POS":266, ...

On 2020-05-29, we don't have a cached file for that URL, so either it was down, there were network issues, or there was an issue during download for processing. As a result, we fell back to the next available scraper for that location, which happened to be our New York Times dataset. They store their data in https://github.com/nytimes/covid-19-data, and we had the following record in https://raw.githubusercontent.com/nytimes/covid-19-data:

2020-05-28,Washington,Minnesota,27163,576,30

This isn't ideal, of course! But I'm not sure exactly what we can do about it at the moment, with this architecture.

In our new codebase (Li, in this github org), we do crawls much more frequently, so source data should be more stable, and hopefully we won't have as many holes in our cache for the sources.

jzohrab commented 4 years ago

(fyi @lazd and @ryanblock on the above)