corona-warn-app / cwa-documentation

Project overview, general documentation, and white papers. The CWA development ends on May 31, 2023. You still can warn other users until April 30, 2023. More information:
https://coronawarn.app/en/faq/#ramp_down
Apache License 2.0
3.28k stars 344 forks source link

Difference in trends for 7-day incidence and 7-day average #528

Closed nilsalex closed 3 years ago

nilsalex commented 3 years ago

Avoid duplicates

Technical details

Describe the bug

As of now (16.02.2021, 17:11 CET), CWA shows a 7-day average of 7,274 confirmed infections and a 7-day incidence of 58.7/100,000. For the 7-day average, an arrow pointing towards the lower right indicates a downward trend, while for the 7-day incidence, an arrow pointing to the right indicates a stable trend. Yesterday, the difference was even higher: a downward trend vs an upward trend.

My understanding is that both numbers are related by a factor like

(7-day incidence) = (7-day average) * 7 * 100,000 / (about 83,000,000)

and therefore, the trend should always be the same. Or is there more to it?

Steps to reproduce the issue

Open the app and swipe through the widgets.

image

image

Expected behaviour

Same trend for both indicators.


Internal Tracking ID: EXPOSUREAPP-5225

Ein-Tim commented 3 years ago

@nilsalex

I think the cause for this is the following:

Die Anzahl der Fälle - und deren Differenz zum Vortag - und die Anzahl der Todesfälle beziehen sich auf Fälle, die dem RKI täglich übermittelt werden. Dies beinhaltet Fälle, die am gleichen Tag oder bereits an früheren Tagen an das Gesundheitsamt gemeldet worden sind. Bei den Fällen in den letzten 7 Tagen und der 7-Tage-Inzidenz liegt das Meldedatum beim Gesundheitsamt zugrunde, also das Datum, an dem das lokale Gesundheitsamt Kenntnis über den Fall erlangt und ihn elektronisch erfasst hat.

(Source: https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html).

This would explain the difference, or?

(pinging @MikeMcC399 since he has a great understanding of such things)

MikeMcC399 commented 3 years ago

@Ein-Tim I'm definitely not an expert on these statistics, but I can Google!

Start first by tapping the ℹ️ icon in the app for the definitions.

Then access the raw data through https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html and a link in that page to https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Daten/Fallzahlen_Daten.html

According to that Excel file in tab "BL_7-Tage-Inzidenz" the 7-Day Incidence on Feb 16, 2021 of confirmed new infections was 58.7 and 7 days before that on Feb 9, 2021 it was 72.8. So that was a downwards trend of 14.1 or -19% based on the Feb 9 data.

Using the tab "BL_7-Tage-Fallzahlen" I couldn't find values which matched the ones in the app, so I used the tab "Fälle-Todesfälle-gesamt" instead. The sum of Differenz Vortag Fälle for Feb 10 to Feb 16, 2021 is 50919, divided by 7 is 7274. The sum for Feb 3 to Feb 9, 2021 is 63839, divided by 7 is 9120. This is a difference of 1846 or -20% compared to the Feb 9 data.


Based on that I don't understand why the app is showing Trend: Steady for the 7-Day Incidence when, according to the figure I quoted, the trend is 19% down and this is more than the 5% threshold to declare it as Trend: Downwards and mark it with a green arrow.

This needs to be looked at.

Thanks to @nilsalex for bringing this up!

Ein-Tim commented 3 years ago

Thank you @MikeMcC399 for checking (I can Google too, but I have to admit that you are often better in explaining (such) things than me 😅)

I assume this also affects Android, or?

If yes, please move it to the documentation repo.

MikeMcC399 commented 3 years ago

@Ein-Tim Yes, this also affects Android, so it should be in the documentation repo.

I think it should be looked at urgently because the 7-Day Incidence value and the trend is the one figure that everybody, including politicians, are looking at to influence the decision about the easing of lockdown.

MikeMcC399 commented 3 years ago

For ease of reference here are the RKI daily reports for Feb 16, 2021 and for 7 days previously on Feb 9, 2021.

2021-02-09-en.pdf 2021-02-16-en.pdf

These show the figures

Date 7-Day Incidence per 100,000 population
Feb 9, 2021 73
Feb 16, 2021 59

which is a clear downwards trend (that I am sure we are all happy to be seeing 👏!)

MikeMcC399 commented 3 years ago

The value today, Feb 17, 2021, for 7-Day Incidence is 57.0 and the trend is down, which looks good.

Date 7-Day Incidence per 100,000 population
Feb 10, 2021 68
Feb 17, 2021 57

The data for yesterday should still be investigated though.

Ein-Tim commented 3 years ago

@dsarkar Could you take a look at this and transfer it to the correct repo?

Thanks!

MikeMcC399 commented 3 years ago

The value today, Feb 18, 2021, for 7-Day Incidence is 57.1 with "Trend: Steady".

Date 7-Day Incidence per 100,000 population *
Feb 11, 2021 64.2
Feb 18, 2021 57.1

The incidence has decreased by 7.1 or 11% of 64.2, so why does it show "Trend: Steady" not "Trend: Downwards"?

* Values from Fallzahlen_Kum_Tab.xlsx

MikeMcC399 commented 3 years ago

It looks like the trend indicator is just comparing to the value from the previous day, whereas the help text says "The trend compares the value from the previous day with the value from two days ago or, for the 7-day trends, the average value from the last 7 days with the average value from the 7 days prior to that." So the displayed comparison does not correspond to the method described in the help text. (Or I have misunderstood!)

Date 7-Day Incidence per 100,000 population
04.02.2021 80,7
05.02.2021 79,9
06.02.2021 77,3
07.02.2021 75,6
08.02.2021 76,0
09.02.2021 72,8
10.02.2021 68,0
11.02.2021 64,2
12.02.2021 62,2
13.02.2021 60,1
14.02.2021 57,4
15.02.2021 58,9
16.02.2021 58,7
17.02.2021 57,0
18.02.2021 57,1

The full help text from statistics_explanation_trend_text is:

EN

"Trend"

"The arrow direction indicates whether the trend is increasing, decreasing, or remaining steady – that is, demonstrates a deviation of less than 1% compared to the previous day or 5% compared to the previous week. The color indicates this trend as positive (green), negative (red), or neutral (gray). The trend compares the value from the previous day with the value from two days ago or, for the 7-day trends, the average value from the last 7 days with the average value from the 7 days prior to that."


DE

"Die Pfeilrichtung zeigt an, ob der Trend nach oben oder nach unten geht oder relativ stabil ist, d.h. eine Abweichung von weniger als 1% im Vortagesvergleich bzw. 5% im Vorwochenvergleich aufweist. Die Farbe bewertet diesen Trend als positiv (grün), negativ (rot) oder neutral (grau). Der Trend vergleicht den Wert vom Vortag mit dem Wert von vor zwei Tagen bzw. für die 7-Tage-Trends den Mittelwert der letzten 7 Tage mit dem der vorausgegangenen 7 Tage."

dsarkar commented 3 years ago

@MikeMcC399 regarding your last comment:

MikeMcC399 commented 3 years ago

@dsarkar

I understand these values are already 7-day averages I think I can follow you, you are saying one should compare 17.2./57.0 with 10.2./68.0 which is clearly trending down.

Correct, yes, that is what I am saying. That is how I understand the explanation in the help text. Is that the way you understand the help text as well?

dsarkar commented 3 years ago

@MikeMcC399 Yes, I think I can follow through. For today and today-7 days I also get -11%, for yesterday and yesterday-7 i get -16%

Even (I think that would be wrong) taking averages of the averaged values, I get averaging 11-17 Feb (59.8) and comparing average 4-10 Feb (75.8) a change of -21.1%.

MikeMcC399 commented 3 years ago

@dsarkar

For today and today-7 days I also get -11%, for yesterday and yesterday-7 i get -16%

Agreed! 👍

Even (I think that would be wrong) taking averages of the averaged values, I get averaging 11-17 Feb (59.8) and comparing average 4-10 Feb (75.8) a change of -21.1%.

From my hazy memory of statistics, averages of averages is not a good thing. I think you should discard those numbers and stick with the first line.

Could you pass the issue on to the originators of the statistics?

I assume that the statistics are calculated by RKI and transferred to the CWA infrastructure. I couldn't find any new documentation in https://github.com/corona-warn-app/cwa-documentation covering the statistics calculations and distribution. It looks to me like there is a binary file pulled from /version/v1/stats on the DOWNLOAD_CDN_URL which suggests that the app just has the job of displaying the data, not calculating it. So if there is an issue with what is displayed then something further upstream needs to be looked at.

dsarkar commented 3 years ago

@MikeMcC399 indeed, I was told that the app only displays statistical data, it does not calculate it. I created an internal ticket 5225, and additionally, I will bring this up today in a meeting.

GisoSchroederSAP commented 3 years ago

All, due to a number of questions regarding our statistics I re-calculated all values for "Neuinfektionen" (new infections), the respective average values, the Incidence values and double-checked the trends - back until January 25. Based on the results let me emphasize the following points:

  1. The CWA just presents the data, calculation happens on the backend side.
  2. All numbers presented in the CWA can be reproduced in MS Excel, those numbers are all correct.
  3. All arrows and the respective coloring in the App can be explained, they are correct.
  4. Still, the referenced wording above seems to lead to confusion about meaning of the value, aggregation of the value, and translation of the dynamics into the arrow indicator.
  5. The naming of each statistics tile in the CWA is clear, but still will be interpreted differently by the folks.
  6. Yes, there are days when the arrows don't follow each other; the one is rising, the other one goes down or stays. Again, this can be proven and explained statistically.

Therefore, we decided to start a new task of communication - it's not yet clear if it becomes a blog, an FAQ entry or any other kind of media. We'll try to "translate" the intention of the statistical metrics shown in the CWA and what are the key drivers for the "trend arrow" indicator.

Believe me, this will not be an easy and fast task, as it challenges us to gain trust by "translating" the statistics into consumable portions of knowledge - how to read the tiles. So, I kindly ask you to stay patient. Furthermore, I want to encourage you to give feedback, once we provide first results in this matter.

GisoSchroederSAP commented 3 years ago

One more word to @MikeMcC399 and @nilsalex : I cannot comment the full issue here. But I want to let you know (and hope you can adjust your viewpoint and accept): The 7-day-Incidence is not a ~7-day-trend~. Instead, the 7-day-Incidence is a normalized value accurate to the current day only, but based on the sum of new infections during the last 7 days . Therefore, this value must not compared to the Incidence value of "day-7" but simply to the Incidence value of yesterday (that is, in fact, based on the new infections of those last 7 days).

MikeMcC399 commented 3 years ago

@GisoSchroederSAP

Thank you for the response and information!

It seems that the help text is difficult to interpret correctly concerning what falls under the category of a "7-day trend". Could you help us out so that we understand this better?

For each of the four values which have a trend arrow:

  1. Confirmed New Infections: 7-Day Average
  2. Warnings by App Users: 7-Day Average
  3. 7-Day Incidence
  4. 7-Day R Value

... could you let us know if the arrow (Upwards, Downwards or Steady) is calculated based on comparing to the corresponding number displayed the previous day or the number displayed 7 days previously?

For "7-Day Incidence" you told us in the previous post that the trend depends on the number displayed from the previous day.

GisoSchroederSAP commented 3 years ago

We are going to write that down, I promise. The naming of "7-Day Incidence " may mislead the reader, it ist to be read as "Today's Incidence (based on the sum of nationwide infections of the last 7 days normalized to 100.000 of all German citizens)" - but certainly, this is much longer than the initial name, and maybe even not really easier to understand, sorry.

nilsalex commented 3 years ago

@GisoSchroederSAP Thanks for looking into this!

The naming of "7-Day Incidence " may mislead the reader, it ist to be read as "Today's Incidence (based on the sum of nationwide infections of the last 7 days normalized to 100.000 of all German citizens)"

I don't think there is any confusion about the definition of the 7-day incidence. And because this metric is defined as above, I really don't get how it can follow a different trend than the 7-day average, which is also -- please correct me if I'm wrong -- based on the sum of nationwide infections of the last 7 days. So I guess my question really is:

Is it not the case that both numbers are the same up to a constant relative factor (of about 7*100,000/83,000,000)? If so, a user cannot expect to see different trends for both numbers, right?

MikeMcC399 commented 3 years ago

@nilsalex The number used for the population of Germany by RKI is close to the 83 Million which you assumed. In https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Daten/Fallzahlen_Daten.html => https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Daten/Fallzahlen_Kum_Tab.xlsx Tab "Tageswerte berechnet" Cell A36 it uses the number 83166711 (which is the number displayed on https://www.destatis.de/DE/Themen/Gesellschaft-Umwelt/Bevoelkerung/Bevoelkerungsstand/Tabellen/zensus-geschlecht-staatsangehoerigkeit-2019.html for the date 31.12.2019).

I would also like to understand the difference in the two trends. I agree that it is not intuitively obvious that they should be different, so I'll be waiting with interest for the details of the calculations. I take on the statement from @GisoSchroederSAP that the calculations are correct, so I expect the reasons for differences will be caused by the calculation methods used.

GisoSchroederSAP commented 3 years ago

Quick note to @nilsalex : You are referring a constant relative factor , which I don't understand. Key for the calculation is the weighted occurrence of new infections per region (=federal state) based on it's fraction 100.000/citizens. You may refer to the factor (*100000/83166711 = 0,001202404 that you can easily multiply by the total number 47.436 (Feb 17) of new infections across all 16 federal states during the last 7 days, which results in the "7-day-Incidence" (for Feb 17) of ~57,04 - which perfectly matches the number shown in the App, right? However: This is already "normalized" to ~5,2 Million citizens (per federal state), which would result in a "normalized" factor of? Correct: 0,001202. You now may want to build the average number of infections per federal state (47436/16)= 2964,75, multiply with the constant factor and multiply this value by 16 to bring it back to the nationwide incidence, which results in: again 57,01

What I am just saying - yes, you can flatten everything by average calculation, but in that case you also have to flatten the distribution of citizens across the country and the number of infections across the country in each region. Those numbers are related to 100.000 citizens for a good reason, and this approach also leads to a value of 57 (nationwide) that is not the average of the incident values for the 16 federal states (which was 62 on Feb 17) - that's why I call is "weighted".

nilsalex commented 3 years ago

@GisoSchroederSAP I may refer back to my initial report, where I was stunned by the different trend for the two metrics, which are:

7-day average: nationwide total of confirmed infections over the last 7 days divided by 7 7-day incidence: nationwide total of confirmed infections over the last 7 days times 100,000/83,166,711

obviously, this yields a constant ratio (7-day incidence) / (7-day average) = 7*100,000/83,166,711 or, in other words, both metrics are the same up to a constant relative factor of 7*100,000/83,166,711. (The exact total of the german population is not relevant for this argument, so I settled for 83,000,000.)

From the numbers you gave me, you seem to agree with this premise. Is this correct? So I would expect to see the same trend for both metrics. Even more so, I fail to understand how the trend could ever be different.

nilsalex commented 3 years ago

I take on the statement from @GisoSchroederSAP that the calculations are correct, so I expect the reasons for differences will be caused by the calculation methods used.

Oh, that may very well be true and I am also curious about that, but then I would argue that the method should be different. (In the sense that it should be the same for both metrics ;-) )

GisoSchroederSAP commented 3 years ago

Sorry, now, please understand: The 7-day Incidence IS NOT a linear average, it is a single value WEIGTHED by population. Your calculation

(7-day incidence) / (7-day average) = 7*100,000/83,166,711

is NOT valid, as the average is NOT WEIGTHED at all.

nilsalex commented 3 years ago

You are absolutely correct in saying that a bottom-up calculation from the incidences in federal states is in fact a weighted average. I don't dispute this fact.

But then again, we can calculate the same numbers using nationwide totals, as you confirmed earlier.

We can give a proof, just so we are on the same page.

Definitions:
s_k: confirmed infections for federal states over the last seven days
S: nationwide confirmed infections over the last seven days
n_k: population of federal states
N: nationwide population
i_k: 7-day incidence for federal states (n.b. dimensionless, we don't need to say "per 100,000")
I: 7-day incidence nationwide

S = Σ s_k
N = Σ n_k

i_k = s_k / n_k

I = (Σ i_k * n_k) / (Σ n_k) = (Σ i_k * n_k) / N

#######################

Theorem: I is the quotient of S and N
Proof: I = (Σ i_k * n_k) / N = (Σ s_k * n_k / n_k) / N = (Σ s_k) / N = S / N
∎

You see, it is perfectly permissible to state the problem using the nationwide totals. We could state it otherwise, it doesn't really matter. If it's preferred by you, we can talk about weighted averages. The problem at hand still stands: The numbers are essentially the same (assuming a constant population, but I think that's given?) and therefore it does not make sense to show a different trend.

GisoSchroederSAP commented 3 years ago

Guys, I will write this calculation down sometimes, I promise. The above calculation by @nilsalex fails because of the formula of the 7-days incidence

i_k = s_k / n_k

and because the wrong statement:

(n.b. dimensionless, we don't need to say "per 100,000")

In fact, in order to get the values "normalized" to the same "portion" of the population, you would try: p_k ... population of the state k 'N = Σ p_k n_k = p_k / 100.000 ... normalized portion as stated "Infections PER 100.000 PERSONS" Following your approch i_k = s_k / n_k this leads to the official formula `

i_k = 100.000 * s_k / p_k or i_k = s_k * 100.000/p_k for each state, and you easily can expand this to
I = 100.000 (Σ s_k ) / (Σ p_k) = 100.000 * S/N = S * 100.000/N

Again, please note the 100.000/p_k "normalization". I think this is the missing link. As far as I see, your whole calculation is a simple linear arithmetic average calculation. I suggest, you proof yourself with calculation of the factor 100.000/p_k for each state to see the different weight for the product with s_k.

In the end: Yes, you can easily create the average population p_average for each state by adding all population p_k into N and devide by 16. Yes, you can easily create the average number of infections for any state s_average by adding all infections into S and devide by 16. If you now do the same with the factor 100.000/p_k , you may create the "average factor 100.000/p_k" just by adding the fractions and devide by 16 - say: f_average

Now compare f_average (0.04113) with the expected value of 100.000/N (0.0012024)

= = = Please excuse, if I will not immediately comment each alterative calculation as it becomes time consuming to validate other approaches. We will try to create "consumable" communication about the CWA's statistics. And I am in close contact to the RKI experts and to the SAP Analytics Department for further validation. This GitHub issue now already has the full explanation of the math and why you cannot link the trend of average new infections with the absolute incidence based on absolute infections per regions weigthed/normalized by the respective population. Thx.

nilsalex commented 3 years ago

If you now do the same with the factor 100.000/p_k , you may create the "average factor 100.000/p_k" just by adding the fractions and devide by 16 - say: f_average

This is the misunderstanding. Why would I do that? I never referred to the unweighted average of state-wide numbers. In fact, I never touched numbers for individual states until you brought them up :-)

The average I am concerned with is (nationwide confirmed infections for the last 7 days) / 7, because that is the metric shown in CWA.

To cite myself:

7-day average: nationwide total of confirmed infections over the last 7 days divided by 7 7-day incidence: nationwide total of confirmed infections over the last 7 days times 100,000/83,166,711

obviously, this yields a constant ratio (7-day incidence) / (7-day average) = 7*100,000/83,166,711 or, in other words, both metrics are the same up to a constant relative factor of 7*100,000/83,166,711. (The exact total of the german population is not relevant for this argument, so I settled for 83,000,000.)

From the numbers you gave me, you seem to agree with this premise. Is this correct? So I would expect to see the same trend for both metrics. Even more so, I fail to understand how the trend could ever be different.

I would kindly ask you not to dismiss this report prematurely. The problem has still not been addressed.

(Also, the incidence really is dimensionless. We don't need to introduce an artificial reference population number. I could of course do that, but all arguments are unaffected.)

GisoSchroederSAP commented 3 years ago

Sorry, if you don't accept the incidence is a "normalized" number with the local factor depending on population, we probably will never find together. We agree to disagree.

May I ask, why the RKI would provide different local incidence numbers and how to consolidate those local incidences into a single nationwide number? Do you expect with your calculation the values of Bremen (680.000 citizens) have the same weight into the nationwide incidence calculation like Bavaria with 13.12 Mio citizens (factor ~20)?

If so, well, then we talk about different models, and your incidence is just a simple average calculation. Yes, in that case it always should follow the average trend of the new infections - but sorry, you will no get the same (incidence) numbers that are

The RKI model is different from yours, and therefore, the average model does not count for the incidence, and therefore, the trend of new infections is not related to the development of the incidence on a daily level.

To make that crystal clear: According to the model/approach, the data are correct, and the description clearly refers to the "normalization factor" per 100.000 citizens. @nilsalex , you may not agree to the model - but you cannot call the numbers or the trend indicator wrong - those numbers are valid.

nilsalex commented 3 years ago

This is a gross misrepresentation of my statements. I never said anything of the above. Any careful reader following along will understand this.

I still don't understand how you got the idea that I want to average incidences of federal states without any weight? This would be wrong and I do not propose this. It does not follow at all from my presentation of the math. You calculate the weighted average in a way I 100% agree with---and because of this, it is just another representation of the nationwide 7-day average. As I have shown using basic math in the hope that some nomenclature would clear things up.

Now, there may be a reason why trends are being calculated differently, but this is not at all obvious and probably a bad choice if it results in this discrepancy. I am very curious about this, but until this is resolved, the issue stands and will not go away by sparring over unrelated issues that aren't even disputed.

Can we please agree to lower the temperature? Again, I did not say any of the things you accused me of.

nilsalex commented 3 years ago

Maybe the problem is indeed the "bottom-up" approach, which can yield slightly different numbers than the straightforward---but mathematically equivalent---approach of just taking the nationwide totals. Two factors may play a role:

Curiously, the first tab defines the nationwide incidence as =B22/A36*100000. As I have said and shown repeatedly, this approach is just as valid as the bottom-up approach, but less error-prone. Adopting this, trends shouldn't show this weird glitch.

GisoSchroederSAP commented 3 years ago

Then please, show us where in your calculation you bring in the "infections per 100.000" to your calculation. Maybe, I missed this part. All I am saying: The trend of the rolling (linear) average absolut number of infections (without any relation to regions) across the nation is not directly related to the non-linear but weighted/normalized value of the incidence, which is a number related to 100.000 citizens as documented (and there are good reason to normalize to 100.000 for the authorities). As far as I understand, there is no wired glitch, sorry.

Okay, let's go with numbers and compare (data from Feb 17): 212 new infections during the last 7 days for Tirschenreuth - Incidence: 294 (given by RKI, evaluated with the above formula) 212 new infection during the last 7 days for Osterzgebirge - Incidence : 86 What would be the "combined incidence" (as we cannot talk about nationwide )? Average=190? Or they weighted incidence per 100.000 = 133? Background: Tirschenreuth has 72.406 citizens, Osterzgebirge has 245.586.

BTW: All my calculations and samples above are based on exactly the same file Fallzahlen_Kum_Tab.xslx you are referring to. With those, I can exactly reproduce the numbers shown in the App. And finally, I disagree: Your model is not mathematically equivalent, as it does not include the localization/weighting factor "100.000/p_k" I tried to explain this already by adjusting your model with this factor, that perfectly leads to the calculation used by the RKI. And this factor does not go linear for the nationwide incidence trend (as it is related to local population), while the linear average trend of absolute number new infections without any relation to the popoluation.

Can you please at least agree: Average number of new infections during last 7 days - not related to population Incidence number based on total number of new infections during last 7 days - is related to population

Have a good night.

nilsalex commented 3 years ago

Everything you say about the bottom-up calculation is correct. There is no disagreement on the matter, only on a mathematical presentation by me which is 100% correct and also not in disagreement to your "correction" (can't be a correction if there wasn't anything wrong to begin with ;-) ) In fact, you confirmed my theorem (I feel almost silly for saying it like this, but it is best to clear up mathematical problems with mathematical language) with exact numbers in https://github.com/corona-warn-app/cwa-documentation/issues/528#issuecomment-782024045 .

To re-iterate:

If you wish to use other units, we can multiply everything with 100,000---not material to the argument. I am sorry for having brought that up, I should have just stuck to the notion which uses this arbitrary unit. But again, not material.

I mean, you name-dropped the RKI above, which in the first tab of the Excel file uses exactly what I'm saying all along is an equivalent formula: =B22/A36*100000.

GisoSchroederSAP commented 3 years ago

I think (and will do the evaluation later), the issue comes from the two points:

image

The approach I = (Σ i_k * n_k) / N just not works, because the value i_k cannot put in the sum and be averaged later on. Therefore, the following "cancel out operation" does not work: (Σ s_k * n_k / n_k) / N = (Σ s_k) / N

I played around with a view simple numbers to visualize:

is equal to the one that is correct one (shown in cells G8 and G9 - both calculated without intermediate usage of any i_k value).

Translation of my understanding: The metric incidence is an absolute number (not a trend!), based on the absolute number of new infections of the last 7 days and is normalized/weighted to the population of the considered region. The development/trend of this metric cannot be directly derived from former values of the incidence. As the metric get's normalized to the regional population, the trend of the incidence does not necessarily follow the trend of the linear average of the nationwide number of new infections during the last 7 days (as this has no relation to the population at all). Additionally: Even if the current number of new infection raises, the rolling average of the last 7 days can decrease. (This does not relate to our topic here directly, but I wanted to make this clear. It just means: It sometimes sounds silly, but it is still true.)

I am not sure, if I can convince you with the sample calculation above, but I want to repeat again: The numbers are correct, the trends are correct, and there is no correlation between

nilsalex commented 3 years ago

I am very curious as to what your definition of national incidence from regional incidences is. That is to say, the function

I(i_k, n_k)
where
i_k: regional incidences
n_k: regional population

Because you seem to disagree with the basic mathematical notion of the weighted average

I(i_k, n_k) = (Σ i_k * n_k) / (Σ n_k)

which is of course the only sensible definition. How else would you aggregate intensive properties?

So, please, what is the "correct" formula? What did you type into Excel?

Because: I agree. The numbers are correct. Up to rounding errors or other glitches like inconsistencies with state-wide population totals. This is easily cured by just using the totals to begin with. It is that easy.

The issue is rather miniscule because it should only occur for corner cases. However, this case did come up last week repeatedly as R_t approached 1. CWA as source for information is a great idea, so we should do everything we can to present the information consistently.

GisoSchroederSAP commented 3 years ago

So, please, what is the "correct" formula? What did you type into Excel? Already done multiple times: Take either I = S/N or I = s_avg/n_avg but do not insert i_k as it does not go linear with s_k (you did check the Tischenreuth/Osterzgebirge example here, didn't you?) image

The information provided by the CWA is correct and consistent (and we don't at all talk about rounding errors here, please). I just kindly ask you not to compare the trend of one number (without any relation to population) with another number (that is related to the population by definition).

Thank you.

MikeMcC399 commented 3 years ago

@nilsalex / @GisoSchroederSAP

Have you considered about the different dates used for the two different values? I think this means that the data sets used may be slightly different depending in one case when the data was received by RKI and in the other case when the data was received by the Gesundheitsamt.


https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html says

"Die Anzahl der Fälle - und deren Differenz zum Vortag - und die Anzahl der Todesfälle beziehen sich auf Fälle, die dem RKI täglich übermittelt werden. Dies beinhaltet Fälle, die am gleichen Tag oder bereits an früheren Tagen an das Gesundheitsamt gemeldet worden sind. Bei den Fällen in den letzten 7 Tagen und der 7-Tage-Inzidenz liegt das Meldedatum beim Gesundheitsamt zugrunde, also das Datum, an dem das lokale Gesundheitsamt Kenntnis über den Fall erlangt und ihn elektronisch erfasst hat."

nilsalex commented 3 years ago

I = S/N

Is exactly the point.

(you did check the Tischenreuth/Osterzgebirge example here, didn't you?)

Let me walk you through.

n_1 = 72,406
n_2 = 245,586
s_1 = 212
s_2 = 212
i_1 = 294
i_2 = 86

I1 = (Σ i_k * n_k) / (Σ n_k) = (294 * 72,406 + 86 * 245,586) / (72,406 + 245,586)
   = 133.36

What are you even trying to argue?

but do not insert i_k as it does not go linear with s_k

Is a nonsensical statement.

The numbers may be correct in some sense, but they are inconsistent. There are explanations, and @MikeMcC399 proposed another one. But please stop trying to gaslight me with phony mathematical arguments.

We should strive for consistency and use the consistent solution. Which you agree with:

I = S/N

Please forgive me, but I am not willing to accept your accusations anymore. Stop it, please.

GisoSchroederSAP commented 3 years ago

Final comment here: I did a sample of three days for a few selected federal states and the nationwide summary. This sample should support my statements:

image

In order to keep my promise, I will stop here explaining again and again, what can be reviewed and validated by everyone. All formulas are given already, but my goal is to make these numbers "consumable" for the users. It seems not fruitful to discuss the buttom-up/top-down or any other approach on a statistics level anymore. If I cannot convince you, @nilsalex , then we are on a dead-end here, sorry, as I seem to be unable to dispel your concerns. You may address your statement of mathematical inconsistency of the data directly to the RKI and to the T-Systems data analysts. I'm happy to help you with finding the right contacts, if you wish

Hopefully, I gave insight and was able to earn trust by the other users following this issue. Thank you.

nilsalex commented 3 years ago

The trends of i_k and s_k are different (which is not by rounding errors) and they are not "bound" to each other, they developm independendly.

No, this is impossible. Your tabulation must contain some errors. You first confirm the constant ratio, but then calculate a different trend. Going bottom-up, this accumulates error. The solution: top-down.

I mean, bottom-up works. But it is a detour where you can mistakes. We could fix them or go the easy way.

I'/I = (sum(n_k i'_k) / sum(n_k)) * (sum(n_k) / sum(n_k i_k)) = sum(s'_k) / sum(s_k) = S'/S

If you cannot accept this, I am not to blame.

MikeMcC399 commented 3 years ago

I haven't dug in to the details quite so deeply, but I'm convinced that the dates are the cause of the issue.

Today, Saturday, Feb 20, 2021 as shown by the app:

The 7-Day Average is the sum of confirmed new infections today and the previous six days, which is 50 436, divided by 7 days = 7 205.

The 7-Day Incidence is the sum of infections based on the date the infection was reported to the Gesundheitsamt. The number labelled "Fälle in den letzten 7 Tagen" is reported to be 48 042. (Note this is a different number to 50 436 above.) This number normalized against the nominal population of the country (100 000 / 83 166 711) gives a 7-Day Incidence of 57.8

If the values of "7-Day Average" and "7-Day Incidence" are based on a different data sets due to the underlying calendar dates, then the trends of the two values may also differ.

GisoSchroederSAP commented 3 years ago

This is true, Mike, different data sets are another reason (and should be communicated clearly) why numbers differ. (However, I just worked on the one file _Fallzahlen_KumTab.xlsx to validate the incidence numbers. + the population numbers coming from https://de.statista.com/). Though, I just emphasize: The 7-Day-Average of new infections is an absolute number (counted nationwide) without any relationship to the distribution of the regional population. It's just a rolling average number across the nation. The Incidence is "bound" to the weighted number of regional new infections (based on population), it is not a rolling average number across the nation. Those two numbers are not at all an "equivalent", imho.

MikeMcC399 commented 3 years ago

Hi Giso @GisoSchroederSAP

I ran a correlation test using the Excel CORREL() function comparing sets of 14 day's data for 7-Day Average and 7-Day Incidence going back to Jan 1, 2021 and the correlation varies between 95% and 99% (it is never 100%), so it can be very tempting to assume that the data sets are equivalent, because they are so close. As we've seen though, they are not the same!

Regarding your point about the 7-Day Incidence being weighted: I'm not seeing this in the Excel Fallzahlen_Kum_Tab. If you take the value "Gesamt" in Line 20 of "BL_7-Tage-Fallzahlen" and divide it by the population factor 831.66711 you get exactly the "Gesamt" value in Line 20 of "BL_7-Tage-Inzidenz". For example, from yesterday, the value in cell KE20 of "BL_7-Tage-Fallzahlen" is 47266, divided by 831.66711 makes 56.8, which is exactly the value in cell KF20 of "BL_7-Tage-Inzidenz". That is true for every day except Jan 27, 2021 going all the way back to May 6, 2020. I assume that one day is just a glitch.

Whether it's a weighted result or not doesn't really matter though if the underlying datasets are different.

My conclusion anyway is that that although the 7-Day Average and the 7-Day Incidence are closely related, their trends may not be the same on any given day, so I agree with you that the display is correct based on the data source Fallzahlen_Kum_Tab as comparison.

It may be worth reviewing the help text though because it can easily be misunderstood that the trend of the 7-Day Incidence is based on a comparison with 7 days previously and not with the value two days ago.

nilsalex commented 3 years ago

@MikeMcC399 Yes, I agree. Such an effect can be an explanation for the discrepancy. However, it should not be the reason. Because, the expectation is clear:

I = S/N
I'/I = S'/S

This does not change for values calculated from regional values. Any objections to this basic fact by @GisoSchroederSAP are wrong on the merits.

We cannot dispute proven mathematical facts.

Now, if there are different data sources for both values, we should settle on one of them. Absent a good reason, but which reason would that be?

Edit: Sorry, I did not see your latest comment. So it is the explanation. Thanks for digging in to this! So I would suggest to consolidate the metrics. The current state breaks expectation by any reasonable user.

GisoSchroederSAP commented 3 years ago

Thanks for making this double-check, @MikeMcC399 . And yes, the "wording" of the help text was the very first I stated internally to the product owner. This is already under review.

nilsalex commented 3 years ago

I cannot state this enough:

The Incidence is "bound" to the weighted number of regional new infections (based on population), it is not a rolling average number across the nation.

Is just false. Assuming both metrics refer to the same set, of course---that is, both or none are correct w.r.t. symptom onset.

(To be perfectly clear: Yes, it is the weighted average of local incidences. But incidentally (pun intended), this translates into the nationwide incidence which is the ratio of nationwide totals. By multiplication with national population, you have the nationwide infections over the last 7 days.)

The hostility towards me because you disagree with this basic fact has no place here. I am truly disappointed that people are treated this way in this community.

Now, you say "won't fix" because you have a good reason for using different numbers (one corrected, one not corrected, whatever). That is kind of acceptable, although not optimal. But your entire argument and personal attacks did not revolve around this.

GisoSchroederSAP commented 3 years ago

@nilsalex ,

This does not change for values calculated from regional values. Any objections to this basic fact by @GisoSchroederSAP are wrong on the merits.

Then just explain the difference of all these number I'/I and S'/S for any given day - those are calculated directly from the only one source (in fact, the source numbers are all in the only one table above, not from different sources - and yes, these numbers are quite close together.

image

If you excuse me, I'm going to stop the discussion here. We have a different view on this, I can live with that and will return to my task.

MikeMcC399 commented 3 years ago

@GisoSchroederSAP

When the facts have been checked with the product owner, we should also consider updating the FAQ https://www.coronawarn.app/en/faq/#further_details including the point about how the data movements of 7-Day Average and 7-Day Incidence are only loosely coupled with an explanation of why this is so.

Probably this has not been obvious before because the RKI daily situation reports do not show a trend for these two indicators. The press tends to use the 7-Day Incidence alone. This may be the first time that the two values have been displayed together closely and with trends. The display is likely to cause confusion to other people even though it is technically correct.

GisoSchroederSAP commented 3 years ago

Thanks, @MikeMcC399 , I can already state that also the FAQ is under review. We definitely will enhance this communication - over time.

Ein-Tim commented 3 years ago

I'm curiously reading this, and I really don't understand anything about these numbers, etc, so I won't make any statement here.

But I want to ask:

What should we do now, IIUC @nilsalex does not consider this as solved, but @GisoSchroederSAP does? Maybe the best way is what has been proposed above by @GisoSchroederSAP:

You may address your statement of mathematical inconsistency of the data directly to the RKI and to the T-Systems data analysts. I'm happy to help you with finding the right contacts, if you wish

Would that be a good solution for all parties involved here?

GisoSchroederSAP commented 3 years ago

I never accused you of hostility or insults. I never used those idiom mentioned above. I only explained what I think is right and what I think is wrong with your argumentation. Please, excuse if this threatened you - this was definitely not my intention.

Again: I offer support, getting you contacts at the source of the data and calculations. You may discuss and resolve this there.

Good evening.