RamiKrispin / coronavirus

The coronavirus dataset
https://ramikrispin.github.io/coronavirus/
Other
498 stars 209 forks source link

Bug in Italian data ? #18

Closed dskov closed 4 years ago

dskov commented 4 years ago

There might be a bug in the data of the 12th of March. When I apply a filter by country, I don't see any new cases that day in Italy or France

This is the filtered data for Italy for:

  date       confirmed death recovered active confirmed_cum death_cum
  <date>         <int> <int>     <int>  <int>         <int>     <int>
1 2020-03-09      1797    97       102   1598          9172       463
2 2020-03-10       977   168         0    809         10149       631
3 2020-03-11      2313   196       321   1796         12462       827
4 2020-03-12         0     0         0      0         12462       827
5 2020-03-13      5198   439       394   4365         17660      1266
6 2020-03-14      3497   175       527   2795         21157      1441

Data before applying the filter:

Italy | 43.0000 | 12.0000 | 2020-03-12 | 0 | recovered

D.

RamiKrispin commented 4 years ago

Hi @dskov, could you please provide the code you used to create the data above that I can reproduce it on my side?

dskov commented 4 years ago

Hi,

Here is a bit of code:

data("coronavirus")

# Select country list
#country_list <- c("Italy", "France", "Germany", "UK", "Switzerland", "Norway")
#country_list <- c("China")
#country_list <- c("US")
country_list <- c("Italy")
#country_list <- c("France")
#country_list <- c("Chile")
#country_list <- c("Null")

# Negate logical operator
`%notin%` <- Negate(`%in%`)

# Scale Function (added scaleFlag)
scaleFlag = 0
scaleFunc <- function(inData, flag){
  if (flag) {
    return(log(inData))
  }
  else {
    return(inData)
  }
}

# Apply Filter
justSelReg <- coronavirus  %>%
  dplyr::mutate(sel_region = Country.Region %in% country_list)

# Readapt Data Set
df_selreg <- justSelReg %>%
  dplyr::group_by(date, type) %>%
  dplyr::summarise(total = sum(cases*sel_region, na.rm = TRUE)) %>%
  tidyr::pivot_wider(names_from = type,
                     values_from = total) %>%
  dplyr::arrange(date) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(active =  confirmed - death - recovered) %>%
  dplyr::mutate(confirmed_cum = cumsum(confirmed),
                death_cum = cumsum(death),
                recovered_cum = cumsum(recovered),
                active_cum = cumsum(active))

# plotting the data
plotly::plot_ly(data = df_selreg) %>%
  plotly::add_trace(x = ~ date,
                    y = ~ scaleFunc(active_cum, scaleFlag),
                    type = "scatter",
                    mode = "lines+markers",
                    name = "Contagios",
                    line = list(color = "#1f77b4")) %>%
  plotly::add_trace(x = ~ date,
                    y = ~ scaleFunc(recovered_cum, scaleFlag),
                    type = "scatter",
                    mode = "lines+markers",
                    name = "Recuperados",
                    line = list(color = "green"),
                    marker = list(color = "green")) %>%
  plotly::add_trace(x = ~ date,
                    y = ~ scaleFunc(death_cum, scaleFlag),
                    type = "scatter",
                    mode = 'lines+markers',
                    name = "Muertes",
                    line = list(color = "red"),
                    marker = list(color = "red")) %>%

  plotly::layout(title = country_list,
                 yaxis = list(title = "Numero de Casos"),
                 xaxis = list(title = "Fecha"),
                 legend = list(x = 0.1, y = 0.9),
                 hovermode = "compare")
tail(df_selreg)
GabrieleZucca commented 4 years ago

The "bug" also affects other countries, like France, Spain and Switzerland. The day is the same: 12-03-2020.

For example, using:

coronavirus %>% filter(Country.Region == "Spain", cases == 0, type == "confirmed") 

I obtain 0 cases for 12-03-2020.

RamiKrispin commented 4 years ago

Hi @dskov, @GabrieleZucca,

The source of this issue is in the raw data that I am using to pull the data. This occurred when there is no change in the cumulative value for two sequence days (which in this case is 11 and 12 of March. Therefore, the value during the March 12th was zero.

There is already an open issue and I will track to see what is the fix.

srenoes commented 4 years ago

official data is like that. There is unlogical values all over the place in many data sets. But it is just data. Data from italy is very good as supplied on local level, provinces and regions: https://github.com/pcm-dpc/COVID-19 You can in many cases make predictions with SIR models

RamiKrispin commented 4 years ago

@srenoes this data is also available on the package since yesterday under the name - covid_italy