epiforecasts / covidregionaldata

An interface to subnational and national level COVID-19 data. For all countries supported, this includes a daily time-series of cases. Wherever available we also provide data on deaths, hospitalisations, and tests. National level data is also supported using a range of data sources as well as linelist data and links to intervention data sets.
https://epiforecasts.io/covidregionaldata/
Other
37 stars 18 forks source link

Italian COVID-19 Integrated Surveillance Data #463

Open InterdisciplinaryPhysicsTeam opened 2 years ago

InterdisciplinaryPhysicsTeam commented 2 years ago

Hi all,

First of all thank you very much for the development and maintenance of this very useful global national and sub-national level COVID-19 incidence data package.

We have read the Development section where you write:

We welcome contributions and new contributors! We particularly appreciate help adding new data sources for countries at sub-national level, [...]

then, exploring the Wiki, we have read the following recommendation:

If you want to make a bigger change, it's a good idea to first file an issue and make sure someone from the team agrees that it’s needed.

Therefore we have opened this preliminary issue to ask if you believe it could be helpful to include the Italian COVID-19 integrated surveillance data we've recently obtained the authorisation to publish containing:

Contacts

Author GitHub Twitter
Pietro Monticone @pitmonticone @PietroMonticone
Claudio Moroni @ClaudMor @Claudio__Moroni
github-actions[bot] commented 2 years ago

Thanks for opening an issue! We'll try and get back to you shortly. If you've identified an issue and would like to fix it please see our contribution guidelines.

RichardMN commented 2 years ago

I'll hop in quickly with a question about how the data you've put together (which is impressive) compares with the aggregated data which the package currently draws from the Department of Civil Protection (https://github.com/pcm-dpc/COVID-19/blob/master/README_EN.md).

The level of disaggregation (gender, age cohort) that you have is more fine-grained than most of the data coming out of covidregionaldata but in some cases we are aggregating across gender and age cohort to get the regional/sub-regional data we have. (I think we do this in Lithuania, at least. I think Germany we are working from a line list.) So I'm not sure whether covidregionaldata has a framework to deal with the sub-population indices. But before we get to that there's the question of if we aggregate across these indices in your data, how does it compare with what we're getting from the Department of Civil Protection?

We recently moved from one Swiss data source to another. We have not [yet] put in a standard way to let users choose between two different datasets (though I think this is sort of possible within the UK data).

InterdisciplinaryPhysicsTeam commented 2 years ago

Hi @RichardMN,

if we aggregate across these indices in your data, how does it compare with what we're getting from the Department of Civil Protection?

The main difference between the integrated surveillance data from the Italian National Institute of Health that we update on a weekly basis here and the surveillance data from the Italian Department of Civil Protection that they update on a daily basis here is that the former contains incidences organised by date of key event while the latter by date of notification (affected by the typical problem of time-varying reporting delays).

For more details you might want to take a look at Del Manso et al. (2020) where the two data streams are described and compared.

CC: @ClaudMor, @pitmonticone

ClaudMor commented 2 years ago

Hello,

Would you have any update on this?

RichardMN commented 2 years ago

Things have been a bit more hectic for the past couple of weeks and I haven't decided to spend an evening writing this code yet. It's going to be a bit picky sorting out how to switch between two data sources (I suppose I'll probably look at what is done for the UK example) and this is probably why I've not written a drop-in replacement yet. I think that other contributors have also been focussed on other projects related to now- and forecasting.

InterdisciplinaryPhysicsTeam commented 2 years ago

Hi @RichardMN, thanks for your reply.

We're certainly willing to help you with the logistics if needed: if you tell us the proper format we could make an additional folder in our repository with the data in the requested format.

RichardMN commented 2 years ago

So here am I with a suggestion, having had a bit of a look at the data.

It would be a lot simpler if the data were in 'tidy' format.

Roughly, this might look like:

date region gender age_cohort indicator count
2020-01-15 Abruzzo M 10_19 deceased 15

[fictional data - I haven't checked what the real numbers would be]

If you prefer to have column names (and region names) in Italian, or all lower case, or not, can all be worked around.

This will make for one very long (as opposed to wide) CSV, but much easier to filter and much easier for our code to aggregate. (And it means not writing code to download 20 x (4 or 5) different separate CSV files, then glue them together, then flatten them, ... which I can do but I'm not looking forward to.)

covidregionaldata is going to squash the age cohorts and the gender data - the package isn't set up to reveal that detail (which is available from some of our other sources). But if you present your data in this "tidy" form it may make your data more accessible for R-minded data scientists who want to try working through all of it.

Edits:

InterdisciplinaryPhysicsTeam commented 2 years ago

Hi @RichardMN, here is the tidy version of our dataset following your suggestion.

Could you tell us if you believe it might be fine? If so, we will notify you here when we'll merge in the main branch.

RichardMN commented 2 years ago

Looks good. Below is a quick reprex for pulling it into R, aggregating it (as we will inside the package) and plotting it.

You have saved me at least an hour of painful url-hackery.

I've not started doing logical tests against it, but in terms of making something which is going to be straightforward to pull into covidregionaldata, thank you very much!

library(vroom)
library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

it_inphyt_data <- vroom::vroom("https://github.com/InPhyT/COVID19-Italy-Integrated-Surveillance-Data/raw/use_initial_conditions/epiforecasts_covidregionaldata/COVID19-Italy-Integrated-Surveillance-Data.csv")
#> Rows: 674503 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (4): region, gender, age_cohort, indicator
#> dbl  (1): count
#> date (1): date
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
it_agg_data <- it_inphyt_data %>%group_by(date,region,indicator) %>% summarise(across(where(is.double), sum), .groups = "drop")
it_agg_data %>% filter(indicator=="confirmed") %>% ggplot(aes(x=date,y=count, colour=region)) +geom_line() +theme_minimal()

Created on 2022-03-14 by the reprex package (v2.0.1)

RichardMN commented 2 years ago

Back with more questions, some of which may take a bit of digging.

What is the care indicator? Is this new patients going into ICU? If a patient is admitted to hospital and goes immediately into ICU, are they counted as both hospitalized and care?

What is your preferred count between symptomatic and confirmed? I think in most other series we're using confirmed but the delay between one and the other may be significant, and tracking asymptomatic but confirmed may be useful too. I can write code to choose between which definition is used (see what the Lithuania code offers for three different criteria for attributing death to COVID) but right now I'm trying to get something running.

Any ideas why we have this in our existing code? (This squashes the two together, so that Trento and Bolzano are both listed as Trentino-Alto Adige. I don't know enough Italian geography or regional/local government to know why this makes sense or doesn't.) It'll have to be amended to match what your region identifiers are but I'm not that familiar with our Italy code so don't quite know why we do this. I wonder if it may be that the two regions share an ISO-3166 code and so they get merged together because in many of our other usages we depend on the ISO-3166 being a unique identifier for regions.

        mutate(level_1_region = recode(.data$level_1_region,
          "P.A. Trento" = "Trentino-Alto Adige",
          "P.A. Bolzano" = "Trentino-Alto Adige"
        )) %>%

For now, #464 is a first write-through of an alternate implementation of the Italy code which uses the InPhyT data. I'll make a PR here and would welcome someone else poking it a bit. Later this week I may try putting in:

InterdisciplinaryPhysicsTeam commented 2 years ago

Hi @RichardMN, thanks for your feedback and your questions.

What is the care indicator? Is this new patients going into ICU? If a patient is admitted to hospital and goes immediately into ICU, are they counted as both hospitalized and care?

We've renamed care with the more explicit ICU_admission and hospitalized with the more explicit ordinary_hospital_admission in our dataset (temporary branch). If a patient is admitted to hospital and goes immediately into ICU, they will not be counted both as ordinary_hospital_admission and ICU_admission, but exclusively as ICU_admission.

What is your preferred count between symptomatic and confirmed?

We have no preferred count between confirmed (confirmed cases by date of diagnosis) and symptomatic (symptomatic cases by date of symptoms onset). It crucially depends on your specific research goal. It might be useful to write some code to easily choose between the two options.

Any ideas why we have this in our existing code? (This squashes the two together, so that Trento and Bolzano are both listed as Trentino-Alto Adige. I don't know enough Italian geography or regional/local government to know why this makes sense or doesn't.)

        mutate(level_1_region = recode(.data$level_1_region,
          "P.A. Trento" = "Trentino-Alto Adige",
          "P.A. Bolzano" = "Trentino-Alto Adige"
        )) %>%

Yes, this aggregation makes perfect sense since Trentino-Alto Adige is the Italian region made up of the two self-governing of Trento and Bolzano.

Please tell us if any further changes are needed.

InterdisciplinaryPhysicsTeam commented 2 years ago

Hi @RichardMN,

Today we've successfully updated our repository merging the new folder epiforecasts_covidregionaldata.

Please don't hesitate to let us know if any further changes are needed.

RichardMN commented 2 years ago

I've adjusted the download url (twice - I got it wrong the first time). Checks appear to be failing in the github workflow but I think that may be because there's a problem with the French data right now.

InterdisciplinaryPhysicsTeam commented 2 years ago

Hello @RichardMN,

Is there anything else we can do on our side to facilitate the transition?

Thanks.

InterdisciplinaryPhysicsTeam commented 2 years ago

Hi @RichardMN,

We've recently solved a few issues and added one age class so that now we provide the following age classification:

{0_5, 6_12, 13_19, 20_29, 30_39, 40_49, 50_59, 60_69, 70_79, 80_89, 90_+}

Here is the updated data.

Thanks.

RichardMN commented 2 years ago

Hi @InterdisciplinaryPhysicsTeam - thank you for the various updates.

There are two slightly interrelated issues. I am not a maintainer of this package and so I cannot apply changes.

The package appears to be moving towards senescence - many of the upstream sources have stopped updating or moved to frequencies which are no longer useful for the epidemiological work which people want to do with data from covidregionaldata. As a contributor I cannot be sure it's "worth" my time to try to develop and apply changes which might never be accepted in or which I may be the only person to be using them.

On a slightly related point, France has changed their data format (three weeks ago) #469 which means that just to get to the point where the patches I made will pass checks and could be applied, I need to go and look at the France code (or someone else does) and get those fixed and applied.

Returning to point 1, I need a sense from @seabbs or @kathsherratt or others whether we're going to try to modularize the package better (so that single country failures don't bork everything else) or just accept that it was very useful for a time but no longer appears to have utility or a market.

This is a bit of a bigger question than belongs in this issue but this appears to be where the conversation might take place.

Bisaloo commented 2 years ago

Hi all, and thanks @RichardMN for bringing up this topic.

As mentioned in https://github.com/epiforecasts/covidregionaldata/discussions/459, we are unsure if this package is still used by / useful to anyone. Because of this, most of the contributors have moved on (excepted @RichardMN, whose heroic efforts to keep this package running need to be highlighted!).

I can help in getting outstanding PR merged though if someone feels that something needs updating / fixing.

Two comments:

If necessary, feel free to ping me. I cannot promise I'll always be responsive but I'll try.

InterdisciplinaryPhysicsTeam commented 2 years ago

Hi @RichardMN @Bisaloo @pitmonticone @ClaudMor,

Thank you @Bisaloo for your reply.

Regarding this specific issue / PR, I haven't moved just yet because it's still not clear to me if this data is unequivocally better than the previous one or if we should keep both data sources with a switch.

It very much depends on which variables you're interested in and would like to make use of.

The main differences between the integrated surveillance data from the Italian National Institute of Health that we update on a weekly basis here and the surveillance data from the Italian Department of Civil Protection that they update on a daily basis here are the following:

For more details you might want to take a look at Del Manso et al. (2020) where the two data streams are described and compared.

Bisaloo commented 2 years ago

Okay, I'm quite convinced we need to keep both data sources, with the ability for the user to switch from one to the other.

@RichardMN, are you interested in implementing this or would you like me to do it? No pressure either way.