epiforecasts / covidregionaldata

An interface to subnational and national level COVID-19 data. For all countries supported, this includes a daily time-series of cases. Wherever available we also provide data on deaths, hospitalisations, and tests. National level data is also supported using a range of data sources as well as linelist data and links to intervention data sets.
https://epiforecasts.io/covidregionaldata/
Other
37 stars 18 forks source link

Add subnational data for Vietnam #413

Closed biocyberman closed 2 years ago

biocyberman commented 2 years ago

As the title says, I would like to add the subnational data for Vietnam. The covid situation their needs closer look into subnational level. This tool would be very helpful. Thanks

RichardMN commented 2 years ago

I’ll try to make some detailed comments later today or tomorrow but there are two things which I think will need to be looked at based on my very superficial first glance:

biocyberman commented 2 years ago

Hi @RichardMN Thanks for checking this out. And the detailed review so far. I will be happy to answer all of the during tomorrow. So just a quick answer below for now.

I’ll try to make some detailed comments later today or tomorrow but there are two things which I think will need to be looked at based on my very superficial first glance:

* Do we need to add  `jsonlite` as a dépends in `DESCRIPTION`? I suspect we’re already implicitly depending on it through something else. I’d looked at this when considering an alternate approach for German data [Switch to REST approach for downloading Germany date - possible fix for #408 #410](https://github.com/epiforecasts/covidregionaldata/pull/410) but then the German data problem resolved itself

The data source emits json data, so I used jsonlite. It is therefore a good idea to add it to DESCRIPTION.

* I think the download work should be done in an overclassed `download` method instead of `clean_common` (See that closed PR)

I will take a look at that. That would also explain why I got weird behavior of covidregionaldata::get_regional_data where it downloaded data from common_data_urls.

RichardMN commented 2 years ago

One more fiddly - add "Nam" into the WORDLIST file. This will avoid the spellchecker complaining about it being an odd word.

biocyberman commented 2 years ago

My force push may cause extra work for reviewing. Sorry for that.

biocyberman commented 2 years ago

FWIW, I created a simplified version of the front end here for focusing on Vietnam data: https://biocyberman.github.io/covid

biocyberman commented 2 years ago

Thank you for all the work to integrate Vietnam into covidregionaldata

It's my pleasure and out of necessity to be honest :)

When I look at the map of where we have sub-national data and coverage and where we don't there are lots of white spaces and it's important to fill them in, which can encourage people to use the data for all sorts of purposes.

Do you mean that there are subnational regions that do not have data? I am trying to push to have complete coverages But there are multiple administration layers we have to get through and get that done.

As I noted in another note, this is similar in some ways to work I did in #410 and I think that the package as a whole should be able to use JSON inputs. I think a lot of this code can probably just be moved into a download method and then it may play more nicely with the rest of the class.

Like I answered in other replies, I refactored the download part to download method.

The other thing to look at is running the tests and R-CMD check on the package, and looking at the outputs of the test builds done by github (on MacOS, linux and Windows). I only have access to one OS at home but sometimes the errors from other platforms have shown me glitches which my home system was protecting me from. I haven't written any test for this, and can you elaborate more on R-CMD command to use? Nonetheless, I have built the package and installed in on my local PC to use for other packages (e.g. covid-rt-estimate, covid (front-end))

RichardMN commented 2 years ago

When I look at the map of where we have sub-national data and coverage and where we don't there are lots of white spaces and it's important to fill them in, which can encourage people to use the data for all sorts of purposes.

Do you mean that there are subnational regions that do not have data? I am trying to push to have complete coverages But there are multiple administration layers we have to get through and get that done.

My remark was about our world map which we put into our logo - we have coverage for several European and American countries, but very few in Asia or Africa. Adding Vietnam is a step towards addressing this.

I haven't written any test for this, and can you elaborate more on R-CMD command to use? Nonetheless, I have built the package and installed in on my local PC to use for other packages (e.g. covid-rt-estimate, covid (front-end))

Try running devtools::check() on the package. This will run a number of tests (which GitHub runs on the PR automatically) and identify problems which probably need to be resolved before the PR can be merged. (For various reasons, including that the package will get booted from cran if the issues are present in the code, and it took some effort to get it back onto cran.)

seabbs commented 2 years ago

@biocyberman just a note to say thanks for this and great job porting our website and backend code over for Vietnam. Will look at integrating that other work tomorrow (hopefully).

RichardMN commented 2 years ago

So, a bit of exploratory work.

There is now a branch here called json-reader which has a very straightforward implementation of a drop-in download_JSON.

There is also a branch called json-reader-vietnam which uses that implementation and the bulk of @biocyberman's Vietnam code to start using that approach to pull in data.

This gets me to the point where I have four lists of lists (of lists?):

The first three are list of 63 lists (one for each province), where the items of the list are the date (as a name) and a value (as an integer):

library(covidregionaldata)
vn <- covidregionaldata::Vietnam$new()
vn$download()
#> Downloading data from https://covid.ncsc.gov.vn/api/v3/covid/provinces?filter_type=case_by_time
#> Downloading data from https://covid.ncsc.gov.vn/api/v3/covid/provinces?filter_type=death_by_time
#> Downloading data from https://covid.ncsc.gov.vn/api/v3/covid/provinces?filter_type=recovered_by_time
#> Downloading data from https://covid.ncsc.gov.vn/api/v3/covid/provinces
vn$data$raw$main[[1]][[1]][515:519]
#> $`13/9/2021`
#> [1] 4310
#> 
#> $`14/9/2021`
#> [1] 4331
#> 
#> $`15/9/2021`
#> [1] 4345
#> 
#> $`16/9/2021`
#> [1] 4360
#> 
#> $`17/9/2021`
#> [1] 4360

Created on 2021-09-16 by the reprex package (v2.0.1)

This is where someone else's R skills will outstrip mine to figure out how to convert each of these lists into a table of date and count (with appropriate column label), then apply the number index of that list as the province code (possibly prefixing it with VN-), and then bind the rows, and then full join the three resulting tables. I'm sure I could stare at the dplyr crib sheet for a while but I don't think I can do that tonight.

biocyberman commented 2 years ago

@biocyberman just a note to say thanks for this and great job porting our website and backend code over for Vietnam. Will look at integrating that other work tomorrow (hopefully).

Great to hear. Thank you for yours and the team's work :+1:

biocyberman commented 2 years ago

So, a bit of exploratory work.

There is now a branch here called json-reader which has a very straightforward implementation of a drop-in download_JSON.

There is also a branch called json-reader-vietnam which uses that implementation and the bulk of @biocyberman's Vietnam code to start using that approach to pull in data.

Great. I will take a look at this and incorporate into the code.

This gets me to the point where I have four lists of lists (of lists?):

* main (cases by date)

* death_by_time

* recovered_by_time

* provinces

The first three are list of 63 lists (one for each province), where the items of the list are the date (as a name) and a value (as an integer):

library(covidregionaldata)
vn <- covidregionaldata::Vietnam$new()
vn$download()
#> Downloading data from https://covid.ncsc.gov.vn/api/v3/covid/provinces?filter_type=case_by_time
#> Downloading data from https://covid.ncsc.gov.vn/api/v3/covid/provinces?filter_type=death_by_time
#> Downloading data from https://covid.ncsc.gov.vn/api/v3/covid/provinces?filter_type=recovered_by_time
#> Downloading data from https://covid.ncsc.gov.vn/api/v3/covid/provinces
vn$data$raw$main[[1]][[1]][515:519]
#> $`13/9/2021`
#> [1] 4310
#> 
#> $`14/9/2021`
#> [1] 4331
#> 
#> $`15/9/2021`
#> [1] 4345
#> 
#> $`16/9/2021`
#> [1] 4360
#> 
#> $`17/9/2021`
#> [1] 4360

Created on 2021-09-16 by the reprex package (v2.0.1)

This is where someone else's R skills will outstrip mine to figure out how to convert each of these lists into a table of date and count (with appropriate column label), then apply the number index of that list as the province code (possibly prefixing it with VN-), and then bind the rows, and then full join the three resulting tables. I'm sure I could stare at the dplyr crib sheet for a while but I don't think I can do that tonight.

Yes it was challening to get the final data.frame with dplyr. So for the time being, I used the rather whole school way todo it from #L74 - #L96 here: https://github.com/biocyberman/covidregionaldata/blob/vietnam/R/Vietnam.R#L74 Province code is stored in covidregionaldata::vietnam_codes.

RichardMN commented 2 years ago

Here's a version using purrr. It could still use lubridate to be applied to the date field.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(purrr)
vn <- covidregionaldata::Vietnam$new()
vn$download()
#> Downloading data from https://covid.ncsc.gov.vn/api/v3/covid/provinces?filter_type=case_by_time
#> Downloading data from https://covid.ncsc.gov.vn/api/v3/covid/provinces?filter_type=death_by_time
#> Downloading data from https://covid.ncsc.gov.vn/api/v3/covid/provinces?filter_type=recovered_by_time
#> Downloading data from https://covid.ncsc.gov.vn/api/v3/covid/provinces
vn_data_inputs <- vn$data$raw[1:3]
vn_flat_all <- map(map(vn_data_inputs, function(x) as_tibble(unlist(x), rownames="date")),
                   function(x) {x %>% separate(date, sep="[.]+", into=c(NA, "province", "date"))})
full_join(
  full_join(
    vn_flat_all$case_by_time, vn_flat_all$death_by_time, by=c("province","date"),suffix=c(".cases", ".deaths"),copy=TRUE),
  vn_flat_all$recovered_by_time, by=c("province","date"), suffix=c("",".recovered"), copy=TRUE) %>%
  select(province, date, cases = value.cases, deaths=value.deaths, recovered = value)
#> # A tibble: 32,464 × 5
#>    province date      cases deaths recovered
#>    <chr>    <chr>     <int>  <int>     <int>
#>  1 1        18/3/2020     0      0         0
#>  2 1        19/3/2020     0      0         0
#>  3 1        20/3/2020     0      0         0
#>  4 1        21/3/2020     0      0         0
#>  5 1        22/3/2020     0      0         0
#>  6 1        23/3/2020     0      0         0
#>  7 1        24/3/2020     0      0         0
#>  8 1        25/3/2020     0      0         0
#>  9 1        26/3/2020     0      0         0
#> 10 1        27/3/2020     0      0         0
#> # … with 32,454 more rows

Created on 2021-09-17 by the reprex package (v2.0.1)

RichardMN commented 2 years ago

And the json-reader-vietnam branch now appears to have a fully "tidy" approach to cleaning. (Tidy, but not entirely pretty in some places.) Messy as it is, I still think it's easier to follow. There are likely some glitches of treatment of region codes.

biocyberman commented 2 years ago

Today I think I/we/@biocyberman can try to write a generic JSON downloader which works for Vietnam and acts like the CSV downloader we have. Making it more flexible can be a next step, with the condition that it doesn’t break the Vietnam code.

@RichardMN I looked at json_reader and json_read_vietnam branches. There would be a couple of errors in the new Vietnam.R and I can fix quickly.. However, the json_reader branch need to be merged to master first before I can update this PR to incorporate the json_reader feature. Otherwise, we can merge this PR and use json_reader later.

RichardMN commented 2 years ago

@RichardMN I looked at json_reader and json_read_vietnam branches. There would be a couple of errors in the new Vietnam.R and I can fix quickly.. However, the json_reader branch need to be merged to master first before I can update this PR to incorporate the json_reader feature. Otherwise, we can merge this PR and use json_reader later.

Later today I will bring across the functional parts of the json-reader branch into this PR, and then it can go into master from this PR. It makes sense to introduce it where we may use it. I will put the changes in and @biocyberman can review and check the sanity on the data. I think it better to have the cleaner approach in the master branch from the start, but also to be using the new download_json method when it's added.

If you'd prefer I can make a separate PR for the json-reader branch but I think the code's been pretty much reviewed and it can go from here.

biocyberman commented 2 years ago

@RichardMN I looked at json_reader and json_read_vietnam branches. There would be a couple of errors in the new Vietnam.R and I can fix quickly.. However, the json_reader branch need to be merged to master first before I can update this PR to incorporate the json_reader feature. Otherwise, we can merge this PR and use json_reader later.

Later today I will bring across the functional parts of the json-reader branch into this PR, and then it can go into master from this PR. It makes sense to introduce it where we may use it. I will put the changes in and @biocyberman can review and check the sanity on the data. I think it better to have the cleaner approach in the master branch from the start, but also to be using the new download_json method when it's added.

If you'd prefer I can make a separate PR for the json-reader branch but I think the code's been pretty much reviewed and it can go from here.

In that case I think we can do both json_reader and Vietnam.R in this PR, both to give context and to avoid re-review the json_reader PR.

biocyberman commented 2 years ago

So I've been through all the changes. I also ran covid-rt-estimates package on top of this new version of covidregionaldata. It works. So a green light from me :)

biocyberman commented 2 years ago

@RichardMN and @seabbs I think this PR got stuck because of the failure with codecov/patch. I am not familiar its working and therefore would not make it unstuck quickly. Can you give a push there? Thanks

seabbs commented 2 years ago

Merged in the changes from master, bumped version, updated news, added @biocyberman as a contributor, and ran tests.

biocyberman commented 2 years ago

Great to see this get merged. Thank you for your support @RichardMN, @seabbs @Bisaloo