CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.14k stars 18.44k forks source link

France confirmed cases numbers seem off compared to Santé Publique France (official agency) #3287

Closed ObliviousMonkey closed 3 years ago

ObliviousMonkey commented 3 years ago

I'm seeing weird numbers for confirmed cases in France. Per Santé Publique France, the official public health agency, the daily numbers look like this, with a cumulative number of 1,235,132 confirmed cases on the 2020/10/28 and a peak at 52,010 cases on the 10/25:

However, the data here shows almost double that number

library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)

daily_france <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv") %>% 
  # drop map coordinates and regions
  select(-c(Lat, Long, `Province/State`)) %>% 
  # merge all regions in a single country...
  group_by(`Country/Region`) %>% 
  # ...and sum the values within country
  summarize_all(sum) %>% 

  # Tidy data : pivot dates
  pivot_longer(2:ncol(.), names_to = "Date", values_to = "cases") %>%
  mutate(Date = as.Date(Date, "%m/%d/%y")) %>%

  # calculate daily new cases via reverse cumsum
  mutate(cases = cases - c(0, lag(cases)[-1])) %>% 
  # and delete first day
  filter(Date != min(Date)) %>% 

  # select French data
  filter(`Country/Region` == "France")
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   `Province/State` = col_character(),
#>   `Country/Region` = col_character()
#> )
#> See spec(...) for full column specifications.

# see last 7 days
daily_france %>% tail(7)
#> # A tibble: 7 x 3
#>   `Country/Region` Date        cases
#>   <chr>            <date>      <dbl>
#> 1 France           2020-10-22  41622
#> 2 France           2020-10-23  42668
#> 3 France           2020-10-24     25
#> 4 France           2020-10-25     62
#> 5 France           2020-10-26 124905
#> 6 France           2020-10-27  34591
#> 7 France           2020-10-28  35973

# Plot it
ggplot(daily_france, aes(x = Date, y = cases)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90)) +
  geom_point() +
  geom_line() +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE, big.mark = ",")) +
  scale_x_date(breaks = "1 month")

Created on 2020-10-29 by the reprex package (v0.3.0)

This looks like data on the weekends is almost zero, and the 10/26 spike of 124905 seems to be a result of the catching up of the two previous days (10/24 & 10/25) plus the day of. However in the Santé Publique France dataset, even though it's lower on the weekends, it's absolutely not as low and as a result the peak is half the one in this dataset, and not on the same day.

Does someone have an explanation for this? Thanks for your help!

CSSEGISandData commented 3 years ago

Hello

There are two issues being raised in your comments. The first is the difference between our total numbers and those reported by Sante Publique and the second is the reporting on the weekends. I'll address each separately.

1) Our methodology for France has been detailed previously (#2459). The principal difference is that the number reported on the Sante Publique website (the number 1,235,132) includes cases in overseas territories and dependencies. We report these regions separately in our dashboard as Admin level 1 regions under France, so these totals have to be reported from the total reported for continental France. A secondary difference is our inclusion of probable cases in nursing and care homes. This currently represents a static addition of 37,129 cases as the agency no longer reports cases of this nature.

2) The spike in cases on Mondays is due to our source not publicly posted updated case counts on the weekends. We are looking in alternative sources that provide the France weekend data in a machine readable format. While the weekend reporting is still paused, we recommend using smoothing algorithms to mitigate the impact of these spikes, but as a rule we only match official reporting and do not try to smoothen data internally.

Please let us know if there is further clarification needed.

ObliviousMonkey commented 3 years ago

Hi,

Thanks for the detailed answer!

  1. I accounted for this by summing all data under France from your dataset, but anyway I didn't expect to have an absolutely identical final cumulative number. However I would expect the overall variations to be close.

  2. I use a rolling 7-days average for visualization, but I like to be able to tell when was the max raw value reached, in order to know if there is an active surge and to have a number and a date close/identical to the official one. Sadly this can't be the case when using rolling averages, obviously. For example with a 7-day rolling average, the max value is not 52,010 on 10/25 anymore, but 40,837 on 10/29. There is this official source, but even though it's updated daily (last update is 10/29 and the next should be in a couple of hours) the latest reported day in the dataset is actually 10/26, so there is currently a 6-day lag which I understand is not ideal. I don't know why the numbers displayed here are not available anywhere in a machine readable format, this is very frustrating...

Anyway, thanks for all your work!