Italian COVID-19 Integrated Surveillance Data

InterdisciplinaryPhysicsTeam commented 2 years ago

Hi all,

First of all thank you very much for the development and maintenance of this very useful global national and sub-national level COVID-19 incidence data package.

We have read the Development section where you write:

We welcome contributions and new contributors! We particularly appreciate help adding new data sources for countries at sub-national level, [...]

then, exploring the Wiki, we have read the following recommendation:

If you want to make a bigger change, it's a good idea to first file an issue and make sure someone from the team agrees that it’s needed.

Therefore we have opened this preliminary issue to ask if you believe it could be helpful to include the Italian COVID-19 integrated surveillance data we've recently obtained the authorisation to publish containing:

Daily time series of confirmed cases by date of diagnosis stratified by sex and age at the regional level;
Daily time series of symptomatic cases by date of symptoms onset stratified by sex and age at the regional level;
Daily time series of ordinary hospital admissions by date of admission stratified by sex and age at the regional level;
Daily time series of intensive hospital admissions by date of admission stratified by sex and age at the regional level;
Daily time series of deceased cases by date of death stratified by sex and age at the regional level.

Contacts

Author	GitHub	Twitter
Pietro Monticone	@pitmonticone	@PietroMonticone
Claudio Moroni	@ClaudMor	@Claudio__Moroni

github-actions[bot] commented 2 years ago

Thanks for opening an issue! We'll try and get back to you shortly. If you've identified an issue and would like to fix it please see our contribution guidelines.

RichardMN commented 2 years ago

I'll hop in quickly with a question about how the data you've put together (which is impressive) compares with the aggregated data which the package currently draws from the Department of Civil Protection (https://github.com/pcm-dpc/COVID-19/blob/master/README_EN.md).

The level of disaggregation (gender, age cohort) that you have is more fine-grained than most of the data coming out of covidregionaldata but in some cases we are aggregating across gender and age cohort to get the regional/sub-regional data we have. (I think we do this in Lithuania, at least. I think Germany we are working from a line list.) So I'm not sure whether covidregionaldata has a framework to deal with the sub-population indices. But before we get to that there's the question of if we aggregate across these indices in your data, how does it compare with what we're getting from the Department of Civil Protection?

We recently moved from one Swiss data source to another. We have not [yet] put in a standard way to let users choose between two different datasets (though I think this is sort of possible within the UK data).

InterdisciplinaryPhysicsTeam commented 2 years ago

Hi @RichardMN,

if we aggregate across these indices in your data, how does it compare with what we're getting from the Department of Civil Protection?

The main difference between the integrated surveillance data from the Italian National Institute of Health that we update on a weekly basis here and the surveillance data from the Italian Department of Civil Protection that they update on a daily basis here is that the former contains incidences organised by date of key event while the latter by date of notification (affected by the typical problem of time-varying reporting delays).

For more details you might want to take a look at Del Manso et al. (2020) where the two data streams are described and compared.

CC: @ClaudMor, @pitmonticone

ClaudMor commented 2 years ago

Hello,

Would you have any update on this?

RichardMN commented 2 years ago

Things have been a bit more hectic for the past couple of weeks and I haven't decided to spend an evening writing this code yet. It's going to be a bit picky sorting out how to switch between two data sources (I suppose I'll probably look at what is done for the UK example) and this is probably why I've not written a drop-in replacement yet. I think that other contributors have also been focussed on other projects related to now- and forecasting.

InterdisciplinaryPhysicsTeam commented 2 years ago

Hi @RichardMN, thanks for your reply.

We're certainly willing to help you with the logistics if needed: if you tell us the proper format we could make an additional folder in our repository with the data in the requested format.

RichardMN commented 2 years ago

So here am I with a suggestion, having had a bit of a look at the data.

It would be a lot simpler if the data were in 'tidy' format.

Roughly, this might look like:

date	region	gender	age_cohort	indicator	count
2020-01-15	Abruzzo	M	10_19	deceased	15

[fictional data - I haven't checked what the real numbers would be]

If you prefer to have column names (and region names) in Italian, or all lower case, or not, can all be worked around.

This will make for one very long (as opposed to wide) CSV, but much easier to filter and much easier for our code to aggregate. (And it means not writing code to download 20 x (4 or 5) different separate CSV files, then glue them together, then flatten them, ... which I can do but I'm not looking forward to.)

covidregionaldata is going to squash the age cohorts and the gender data - the package isn't set up to reveal that detail (which is available from some of our other sources). But if you present your data in this "tidy" form it may make your data more accessible for R-minded data scientists who want to try working through all of it.

Edits:

I closed this issue by accident when making this comment, I didn't mean to.
If count is zero then there's no need to have a line for it, it would be implicitly zero. We trade some extra data for each non-zero datum against not storing 0 and field separators as place-holders.

InterdisciplinaryPhysicsTeam commented 2 years ago

Hi @RichardMN, here is the tidy version of our dataset following your suggestion.

Could you tell us if you believe it might be fine? If so, we will notify you here when we'll merge in the main branch.

RichardMN commented 2 years ago

Looks good. Below is a quick reprex for pulling it into R, aggregating it (as we will inside the package) and plotting it.

You have saved me at least an hour of painful url-hackery.

I've not started doing logical tests against it, but in terms of making something which is going to be straightforward to pull into covidregionaldata, thank you very much!

library(vroom)
library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

it_inphyt_data <- vroom::vroom("https://github.com/InPhyT/COVID19-Italy-Integrated-Surveillance-Data/raw/use_initial_conditions/epiforecasts_covidregionaldata/COVID19-Italy-Integrated-Surveillance-Data.csv")
#> Rows: 674503 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (4): region, gender, age_cohort, indicator
#> dbl  (1): count
#> date (1): date
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
it_agg_data <- it_inphyt_data %>%group_by(date,region,indicator) %>% summarise(across(where(is.double), sum), .groups = "drop")
it_agg_data %>% filter(indicator=="confirmed") %>% ggplot(aes(x=date,y=count, colour=region)) +geom_line() +theme_minimal()

^{Created on 2022-03-14 by the reprex package (v2.0.1)}

RichardMN commented 2 years ago

Back with more questions, some of which may take a bit of digging.

What is the care indicator? Is this new patients going into ICU? If a patient is admitted to hospital and goes immediately into ICU, are they counted as both hospitalized and care?

What is your preferred count between symptomatic and confirmed? I think in most other series we're using confirmed but the delay between one and the other may be significant, and tracking asymptomatic but confirmed may be useful too. I can write code to choose between which definition is used (see what the Lithuania code offers for three different criteria for attributing death to COVID) but right now I'm trying to get something running.

Any ideas why we have this in our existing code? (This squashes the two together, so that Trento and Bolzano are both listed as Trentino-Alto Adige. I don't know enough Italian geography or regional/local government to know why this makes sense or doesn't.) It'll have to be amended to match what your region identifiers are but I'm not that familiar with our Italy code so don't quite know why we do this. I wonder if it may be that the two regions share an ISO-3166 code and so they get merged together because in many of our other usages we depend on the ISO-3166 being a unique identifier for regions.

        mutate(level_1_region = recode(.data$level_1_region,
          "P.A. Trento" = "Trentino-Alto Adige",
          "P.A. Bolzano" = "Trentino-Alto Adige"
        )) %>%

For now, #464 is a first write-through of an alternate implementation of the Italy code which uses the InPhyT data. I'll make a PR here and would welcome someone else poking it a bit. Later this week I may try putting in:

[ ] option to switch between Italy data sources
[ ] option to choose between symptomatic and confirmed

InterdisciplinaryPhysicsTeam commented 2 years ago

Hi @RichardMN, thanks for your feedback and your questions.

What is the care indicator? Is this new patients going into ICU? If a patient is admitted to hospital and goes immediately into ICU, are they counted as both hospitalized and care?

We've renamed care with the more explicit ICU_admission and hospitalized with the more explicit ordinary_hospital_admission in our dataset (temporary branch). If a patient is admitted to hospital and goes immediately into ICU, they will not be counted both as ordinary_hospital_admission and ICU_admission, but exclusively as ICU_admission.

What is your preferred count between symptomatic and confirmed?

We have no preferred count between confirmed (confirmed cases by date of diagnosis) and symptomatic (symptomatic cases by date of symptoms onset). It crucially depends on your specific research goal. It might be useful to write some code to easily choose between the two options.

Any ideas why we have this in our existing code? (This squashes the two together, so that Trento and Bolzano are both listed as Trentino-Alto Adige. I don't know enough Italian geography or regional/local government to know why this makes sense or doesn't.)

        mutate(level_1_region = recode(.data$level_1_region,
          "P.A. Trento" = "Trentino-Alto Adige",
          "P.A. Bolzano" = "Trentino-Alto Adige"
        )) %>%

Yes, this aggregation makes perfect sense since Trentino-Alto Adige is the Italian region made up of the two self-governing of Trento and Bolzano.

Please tell us if any further changes are needed.

InterdisciplinaryPhysicsTeam commented 2 years ago

Hi @RichardMN,

Today we've successfully updated our repository merging the new folder epiforecasts_covidregionaldata.

Please don't hesitate to let us know if any further changes are needed.

RichardMN commented 2 years ago

I've adjusted the download url (twice - I got it wrong the first time). Checks appear to be failing in the github workflow but I think that may be because there's a problem with the French data right now.

InterdisciplinaryPhysicsTeam commented 2 years ago

Hello @RichardMN,

Is there anything else we can do on our side to facilitate the transition?

Thanks.

InterdisciplinaryPhysicsTeam commented 2 years ago

Hi @RichardMN,

We've recently solved a few issues and added one age class so that now we provide the following age classification:

{0_5, 6_12, 13_19, 20_29, 30_39, 40_49, 50_59, 60_69, 70_79, 80_89, 90_+}

Here is the updated data.

Thanks.

RichardMN commented 2 years ago

Hi @InterdisciplinaryPhysicsTeam - thank you for the various updates.

There are two slightly interrelated issues. I am not a maintainer of this package and so I cannot apply changes.

The package appears to be moving towards senescence - many of the upstream sources have stopped updating or moved to frequencies which are no longer useful for the epidemiological work which people want to do with data from covidregionaldata. As a contributor I cannot be sure it's "worth" my time to try to develop and apply changes which might never be accepted in or which I may be the only person to be using them.

On a slightly related point, France has changed their data format (three weeks ago) #469 which means that just to get to the point where the patches I made will pass checks and could be applied, I need to go and look at the France code (or someone else does) and get those fixed and applied.

Returning to point 1, I need a sense from @seabbs or @kathsherratt or others whether we're going to try to modularize the package better (so that single country failures don't bork everything else) or just accept that it was very useful for a time but no longer appears to have utility or a market.

This is a bit of a bigger question than belongs in this issue but this appears to be where the conversation might take place.

Bisaloo commented 2 years ago

Hi all, and thanks @RichardMN for bringing up this topic.

As mentioned in https://github.com/epiforecasts/covidregionaldata/discussions/459, we are unsure if this package is still used by / useful to anyone. Because of this, most of the contributors have moved on (excepted @RichardMN, whose heroic efforts to keep this package running need to be highlighted!).

I can help in getting outstanding PR merged though if someone feels that something needs updating / fixing.

Two comments:

Regarding this specific issue / PR, I haven't moved just yet because it's still not clear to me if this data is unequivocally better than the previous one or if we should keep both data sources with a switch. In #464, @RichardMN mentions:

but (see https://github.com/epiforecasts/covidregionaldata/issues/463) I think it may be useful to be able to switch between the two options.

@InterdisciplinaryPhysicsTeam, @ClaudMor, can you weigh in on this please?
About the bigger picture regarding changes while the package is broken for other reasons:

On a slightly related point, France has changed their data format (three weeks ago) https://github.com/epiforecasts/covidregionaldata/issues/469 which means that just to get to the point where the patches I made will pass checks and could be applied, I need to go and look at the France code (or someone else does) and get those fixed and applied.

Please do not worry about this @RichardMN, if you want to submit a change, please feel free to do it, no matter what is the status of the rest of the package. Please don't feel you have a duty to fix other parts of the package to get a change accepted. If tests are failing for an unrelated reason, we can still (most of the time) verify that your PR didn't break anything else and go ahead and merge it.

If necessary, feel free to ping me. I cannot promise I'll always be responsive but I'll try.

InterdisciplinaryPhysicsTeam commented 2 years ago

Hi @RichardMN @Bisaloo @pitmonticone @ClaudMor,

Thank you @Bisaloo for your reply.

Regarding this specific issue / PR, I haven't moved just yet because it's still not clear to me if this data is unequivocally better than the previous one or if we should keep both data sources with a switch.

It very much depends on which variables you're interested in and would like to make use of.

The main differences between the integrated surveillance data from the Italian National Institute of Health that we update on a weekly basis here and the surveillance data from the Italian Department of Civil Protection that they update on a daily basis here are the following:

the former is disaggregated by sex and age while the latter is aggregated;
the former contains daily time series of new confirmed cases, symptomatic cases, ordinary hospital admissions, intensive hospital admissions, deceased cases while the latter includes even performed tests, total tested, cumulative confirmed cases, cumulative hospitalised cases and isolated cases;
the former contains incidences organised by date of key event while the latter by date of notification.

For more details you might want to take a look at Del Manso et al. (2020) where the two data streams are described and compared.

Bisaloo commented 2 years ago

Okay, I'm quite convinced we need to keep both data sources, with the ability for the user to switch from one to the other.

@RichardMN, are you interested in implementing this or would you like me to do it? No pressure either way.

epiforecasts / covidregionaldata

Italian COVID-19 Integrated Surveillance Data #463

Contacts