Upstream data changes break our regional code - Colombia, Cuba, India, United States - Githubissues

epiforecasts / covidregionaldata

An interface to subnational and national level COVID-19 data. For all countries supported, this includes a daily time-series of cases. Wherever available we also provide data on deaths, hospitalisations, and tests. National level data is also supported using a range of data sources as well as linelist data and links to intervention data sets.

https://epiforecasts.io/covidregionaldata/

Other

37 stars 18 forks source link

Upstream data changes break our regional code - Colombia, Cuba, India, United States #430

Closed RichardMN closed 2 years ago

RichardMN commented 3 years ago

Checking my regular graph generation run I see that some of my graphs are not being updated, which made me check which ones and why. (Check modification dates in https://github.com/RichardMN/covidregionaldatagraphs/tree/master/extra/output/images)

Some of the countries below it looks as though the data stream has terminated, others we will need to look more closely at what has changed about the data and whether it is likely to resume. In two others (India, United States) I think the data is still available we just need to adjust how we download and process it.

[ ] Colombia - https://github.com/danielcs88/colombia_covid-19/ - last updated 31 July
[ ] Cuba - https://covid19cubadata.github.io/#cuba - appears to be a change in how deaths are being reported, this entry link says that data will be updated "in a few days" but the latest data accessible through our functions is from 4 July
[ ] India - https://data.covid19india.org - changes to the data available, with a break after 10 October; our code would need to adjust to download separate files and assemble them (roughly as I think we do with UK data)
[x] United States - https://github.com/nytimes/covid-19-data - check the README.md - The New York Times has changed how they present some data and this may also mean we need to download separate files and then assemble them
[ ] Netherlands - something is broken with Hospitalisation data but I don't know what yet

RichardMN commented 3 years ago

I've done some looking at Colombia. We were relying on @danielcs88 code which uses python to grab a massive case list and aggregate it. This appears to have stopped working in late July.

Going further upstream, the Colombian government is making this case list available through a Socrata api. This claims that an api key is required but (thankfully) it appears that we can make simple queries to narrow the data returned.

The following reprex lets us get cases (by diagnosis date) down to level 2.

library(RSocrata)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

cases <- read.socrata("https://www.datos.gov.co/resource/gt2j-8ykr.json?$select=departamento_nom,ciudad_municipio_nom,fecha_diagnostico")

cases_aggregate <- cases %>%
  rename(department = departamento_nom,
         municipality = ciudad_municipio_nom,
         date = fecha_diagnostico) %>%
  mutate(date = as_date(dmy_hms(date))) %>%
  group_by(date, department, municipality) %>%
  summarise(count = n(), .groups = "drop") %>%
  arrange(date)

^{Created on 2021-11-02 by the reprex package (v2.0.1)}

RichardMN commented 3 years ago

Making a checklist of countries to fix:

[x] USA - turns out not to be an issue
[ ] Colombia - fix in #433
[ ] Cuba
[x] India - we have not found a replacement source
[ ] Netherlands - fix in #446

RichardMN commented 3 years ago

Our data source for India stopped updating on 31 October. I have not found a replacement yet.

danielcs88 commented 3 years ago

Just seeing this now @RichardMN, been too busy with school. I stopped running it in July because the calculations for reproductive number would fail repeatedly. Since I didn't write the code for that, I couldn't find a way to fix it, but my code to source the data and shape it into the same format as the NY Times data seems to be still running fine. Just ran it right now, and although painfully slow (the API to download the data), it still updates and formats the data fine.

I didn't try to run the Rt code though, from what I remember it would take at least an hour to run.

RichardMN commented 3 years ago

Hi @danielcs88 - thanks for commenting.

This package for R is about doing data cleaning. @epiforecasts has another package called epinow2 which can do the R_t calculations. I am not familiar with the guts of these calculations but I use EpiNow2 to generate municipality-by-municipality calculations for 60 areas in Lithuania. On a five year-old Mac mini that currently takes a few hours, using half the CPU and only going back about 6 weeks. If you have a machine you can run R on I'm happy to share my (rather clunky) workflow, which feeds into http://projects.martin-nielsen.ca/Graphs/COVID19-Lithuania-Municipalities.html (I should clean and share this workflow anyway...)

All to say that I think that if my PR is approved here we will stop leaning on your python processing of the Colombia data.

danielcs88 commented 3 years ago

@RichardMN hopefully your PR gets approved! I wish I had proficiency in R, I understand it in a basic sense but nothing in a productive sense. If anything from what I have read and seen, R is better for what I use Python for, which is mostly data analysis.

kathsherratt commented 3 years ago

Thank you for flagging this and solving for Colombia @RichardMN !

I had a quick look at the India data source and it's definitely stopped updating altogether with no plans to return. I can't immediately find an obvious alternative either.

On that though I am quite keen to implement #406, to source subnational data from the Google API when we don't have a direct source (or as a backup if a direct source breaks). For our own use case, I think that maintaining continuity is really important, even if there's a bit less visibility on how Google source that data. So personally I would prioritise doing this first before re-instating direct sources for currently broken countries - although obviously both would be good! I will have a look at this today, help vastly appreciated as always!

(Re Rt calculations: we publish updating subnational Rt estimates here), with accompanying data repo, in case it's helpful to see estimates / other R code.)

Kath

github-actions[bot] commented 2 years ago

This issue has been flagged as stale due to lack of activity

Bisaloo commented 2 years ago

Hi @RichardMN, thanks for your continued attention to these issues.

I just reviewed your PR #431 (clean diff against current master here) and it seems to change the data source from raw data to pre-processed data (rolling-average folder), which doesn't seem like something we want.

I visited the NYT repo and it actually doesn't look like the data source we're currently using is going away. It's still updating daily. I did read the announcement in their README and I think it's worded in a confusing way:

UPDATE: The county-level data for cases and deaths that includes seven-day averages and per 100,000 counts is now available in year-based files here. The us-counties.csv file in that directory containing county data since the beginning of the pandemic has been archived and will not be updated.

The key part here is in that directory (i.e., rolling-averages). Since we're not pulling data from this folder, we are fine.

RichardMN commented 2 years ago

The key part here is in that directory (i.e., rolling-averages). Since we're not pulling data from this folder, we are fine.

Thanks. Honestly, it's been three months since I've looked at this and I would trust your judgement on this. We can close the PR without merging.

RichardMN commented 2 years ago

Tagging this in here that #446 is closed but there seems to be a (new?) data source at RIVM which looks as though it may give us hospitalization data separately: https://data.rivm.nl/covid-19/COVID-19_ziekenhuisopnames.csv

Metadata for that at https://data.rivm.nl/meta/srv/dut/catalog.search#/metadata/4f4ad069-8f24-4fe8-b2a7-533ef27a899f

I think that working through to figure out what that data is and whether/how to integrate it back in with the other streams we have can wait for 0.9.4