CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.13k stars 18.43k forks source link

Irregularity of German data on cases and deaths #3417

Open edomt opened 3 years ago

edomt commented 3 years ago

Dear CSSE team,

Our team at ourworldindata.org has recently switched to JHU data for its data on confirmed cases and deaths. We're very grateful for your efforts in making this data available, as our previous source (the European CDC) had announced it would transition from a daily to a weekly update schedule.

So far we've received very little feedback about potential issues, except for two specific countries. The larger one is France (for which some days are assigned 0 case/death), which I've mentioned in https://github.com/CSSEGISandData/COVID-19/issues/3413, so in the issue I'll focus on Germany.

The daily data for Germany is showing a lot of variation (beyond the usual weekly patterns we're seeing in all countries), and several users have reported that it's often pretty far from what's reported by the Robert Koch Institute—and we're not sure what is causing this:

coronavirus-data-explorer(45)

Of course the 7-day rolling average manages to smooth the series pretty well, and we always point our users towards that metric rather than raw daily values.

coronavirus-data-explorer(46)

But still, there are many people who come to our site for daily numbers, and we have received a fair number of emails in the last few days asking us why the German data seemed to suffer from so much irregularity. We'd be grateful if you could give us more information on what's causing this.

Best wishes, Ed

CSSEGISandData commented 3 years ago

Hello Ed,

Thank you for your message. We are aware of OWID switching to our data and we are grateful that you view our work suitable to be sharing with your audience. We also have received your email from several days ago and are still working through a response.

In regards to Germany, our data is sourced from the Berliner Morgenpost rather than the Robert Koch Institute (RKI). We use this source as they are slightly more timely than RKI as they pull data directly from the health departments of Federal States which are generally ahead of the data presented by RKI. For example, a direct comparison of the state-level data between RKI and the Berliner shows that our source is ahead for 12 of the 16 federal states.

The irregularity is due to the means by which the Berliner calculates the total for an unassigned category. The sum of the totals in the federal states does not equal the total number of cases in the country as tracked by the Karlsruhe Institute of Technology, so the source creates an unassigned category that is the difference between this national total and the sum of the federal states. The national total generally is more timely early in the week but the reporting of the federal states and RKI "catches up" over the week, resulting in decreases in the unassigned category that have net zero effect on the total cases and deaths at the national level.

The seven day rolling average looks smooth for this reason. I'll leave this issue open so others can look at this response if they are curious.

edomt commented 3 years ago

Great, many thanks for this very clear explanation! We'll refer people to it in the future if we get more questions.

analphabit commented 3 years ago

Dear CSSE team,

why don't you either stick to the federal data (without the unassigned cases) as sourced by the "Berliner Morgenpost" and other German news outlets OR use the official RKI data published every night/early morning OR use the KIT/Risklayer data directly that doesn't have an "unassigned category"?

The inclusion of an "unassigned category" seems very strange since these have to be cases that are counted but are not assigned to a federal state nor county/city according to "Berliner Morgenpost". But if they don't yet belong somewhere - how can they be counted? In the end the "Berliner Morgenpost" inflates the cases numbers without giving concrete sources.

Also your comment that the "Berliner Morgenpost" is "slightly more timely than RKI" is strange. In fact, the data sourced from the federal states directly simply has other deadlines than the RKI data.

For example, Bavaria publishes it's new daily case numbers around 2 pm and includes cases reported till 10 am on the same day.

The RKI data reflects all new reported cases during the 24 hours of the previous day.

So generally, in the evening the total count of the federal sources produces higher case numbers (even without the "unassigned category") since they already include cases from the running day - and in the early morning the RKI numbers are higher again since at this point the RKI has the most recent data that is reported to them directly by the local authorities (e.g., they might also have included Bavarian cases reported after 10 am).

Of course, there are delays in reporting from all 412 local authorities to the RKI. These delays can add up to a week what distorts the data and explains why RKI data generally shows less cases than KIT/Risklayer data. But to correct for these delays you have to source all 412 local authorities directly (and accept possible other data irregularities RKI might check for) - exactly what KIT/Risklayer/Tagesspiegel are doing.

Regards