ImperialCollegeLondon / covid19model

Code for modelling estimated deaths and cases for COVID19.
MIT License
944 stars 271 forks source link

Sweden data source is wrong #58

Closed itayguy closed 4 years ago

itayguy commented 4 years ago

Describe the bug your data source is wrong - worldometers has wrongs deaths is sweden https://experience.arcgis.com/experience/09f821667ce64bf7be6f9f87457ed9aa is accurate .

image

vs

image

d-Slava commented 4 years ago

thank you itaiguy. very encouraging dynamic, taking into account milder interventions. need to work out herd immunity role. but also why UK with later and less strict lockout does not show such dymamic..

zach-hensel commented 4 years ago

ECDC data used in this model is listed by the date of announcement, not by the date of death... covid-19 deaths in Sweden are reported sometimes over a week later if you compare your picture to this one from today (all data points in your pictures are <75/day; some of the same points are now over 80). Sweden also consistently reports a small fraction on weekends/holidays. Neither data set is particularly good, but the ECDC data set at least only systematically shows too low a number for weekends/holidays rather than every recent day.

image

BoBernhardsson commented 4 years ago

This illustrates the fact that too aggressive predictions made early on too little and too unreliable data is dangerous. Could the erroneous predictions on the Swedish future (which is now becoming more and more obvious from recent days developments) have been avoided, or seen earlier if the correct data had been used, where correct date of deaths were used ? The death data up to March 28 is now stable, so rerunning the predictions using the correct (green) data should be a prioritized matter and the report of March 30 should be revised accordingly.

DrChr commented 4 years ago

I can confirm what @zach-hensel said about the data for the number of deaths on a certain date being continuously updated for days in the past by the Folkhälsomyndigheten. There's actually nothing strange about this in my mind, we just need to be aware that the numbers for the most recent dates are not complete. To give a very concrete example, the data published on the 10th was published at 14:00, so of course the numbers for the 10th cannot include all the deaths on that day. Then there's of course lag in reporting.

To illustrate the lag, below is a comparison of the number of reported deaths for specific dates as they were reported on the 10th and the 15th.

Date Numbers published on the 10th Numbers published on the 15th   Difference
4/2/2020 67 68   1
4/3/2020 65 69   4
4/4/2020 57 60   3
4/5/2020 75 78   3
4/6/2020 74 82   8
4/7/2020 60 70   10
4/8/2020 70 90   20
4/9/2020 23 55   32
4/10/2020 13 52   39

The "Folkhälsomyndigheten" published the data around 14:00 CET each day on this web page: https://www.folkhalsomyndigheten.se/smittskydd-beredskap/utbrott/aktuella-utbrott/covid-19/bekraftade-fall-i-sverige/

Note: The https://www.worldometers.info/coronavirus/country/sweden/ claims to be using https://experience.arcgis.com/experience/09f821667ce64bf7be6f9f87457ed9aa, which should be based on the same data published by Folkhälsomyndigheten via the web page above, but I'm guessing worldometers is not correctly processing the data.

The web page by Folkhälsomyndigheten seems to only be available in Swedish, but it's the second paragraph on the page, with the heading Ladda ner data (Download data), that's relevant. I.e.this paragraph: Data som statistiken ovan bygger på kan laddas ner här (Excel) (The statistics above can be downloaded here (Excel)).

The "here" text links to https://www.arcgis.com/sharing/rest/content/items/b5e7488e117749c19881cce45db13f7e/data which then gives you a spreadsheet.

I will have to confirm later today if this URL is the same each day, as the most recent spreadsheet could then be retrieved via e.g.: wget -q -O data.xlsx https://www.arcgis.com/sharing/rest/content/items/b5e7488e117749c19881cce45db13f7e/data

The number of deaths are found in the tab Antal avlidna per dag (Number of deceased per day), that contains:

Datum_avliden Antal_avlidna
3/11/2020 1
... ...
4/10/2020 52
4/11/2020 50
4/12/2020 54
4/13/2020 45
4/14/2020 31
4/15/2020 6
uppgift saknas 18

However, note the last row where it says uppgift saknas which means "information missing", i.e. no information (at this time) about the date of death for an additional 18 people.

Below are spreadsheets downloaded from Folkhälsomyndigheten on a few recent days: [0] - Folkhalsomyndigheten_Covid19.xlsx [1] - Folkhalsomyndigheten_Covid19-1.xlsx [2] - Folkhalsomyndigheten_Covid19-2.xlsx [3] - Folkhalsomyndigheten_Covid19-3.xlsx

Note: Each of the spreadsheets has a tab whose name indicates the publishing date, e.g. [0] above has a tab called "FOHM 10 Apr 2020" and [3] has a tab called "FOHM 15 Apr 2020".

s-mishra commented 4 years ago

Hi everyone, yes we do realize there are inconsistencies between sources. however, it is not easy to go and collect data independently from all sources. We are working on explicitly handling reporting error in our next models. Also as far as revising results are concerned we update our estimates daily by running data available on the day mentioned. This means if ECDC corrects the numbers our model also uses it, we do not use an online system to use only new data.

BoBernhardsson commented 4 years ago

So lets treat this as not a "data source error" but a more fundamental modeling issue and open a new issue about it! There is significant unmodelled time delay from actual death date to day deaths are reported into the data base, at least for Sweden. The correct time delay in the system is wrongly estimated by maybe as much as a week. The issue might be present also in other countries. It has an unknown, but possibly large effect on the estimation of present epidemic reproduction number and the impact on non-pharmaceutical interventions taken certain days. The deaths reported on a certain day did really happen spread out over about a week before (at least in the case of Sweden). When there is little data, in the early phase of an epidemic, where R is modeled to change quickly over time, the effect can give a misleading indication.

itayguy commented 4 years ago

Thanks for your comments !