PRIO: Twitter Reports: Risk status changes without new checks carried out

Ein-Tim commented 2 years ago

Avoid duplicates

[X] Bug is not mentioned in the FAQ
[X] Bug is specific for Android only, for general issues / questions that apply to iOS and Android please raise them in the documentation repository
[X] Bug is not already reported in another issue

Technical details

Device name: ZTE A2020G Pro
Android version: Android 10
App version: 2.18.1

Describe the bug

I have a Twitter user reporting the following: She received an increased risk encounter for the 04.03.22 two days ago. Since she received the encounter, the app switches between increased risk and low risk.

See these screenshots from today, both were taken with the last risk check carried out at 04:22h:

06:28h	08:10h

Steps to reproduce the issue

I'm not sure whether this happens with all risk encounters at the moment or what exactly could be done to reproduce this. I did not identify any special things while writing with the user on Twitter.

Expected behaviour

The app should never switch between different risk statuses without carrying out a new check.

Additional context

Here is the EN log of the user: all-exposure-checks.txt
The user does not have any check ins in her CWA.
Report on Twitter: https://twitter.com/mimihasstnazis/status/1502180895104684032?s=20&t=4HWEXQmOxeZk4DfSutuE5Q

Pings

@thomasaugsten Please take a look & let me know if you want a CWA error log. @dsarkar, @larswmh or @svengabr Please mirror this to JIRA ASAP. FYI @vaubaehn @mtwalli

Internal Tracking ID: EXPOSUREAPP-12211

larswmh commented 2 years ago

Thanks for forwarding this Twitter report @Ein-Tim. We have created an internal ticket for this and will raise this topic internally. Internal Tracking ID: EXPOSUREAPP-12211

Corona-Warn-App Open Source Team

Ein-Tim commented 2 years ago

CONFIRMED Same behavior by a second Twitter user: https://twitter.com/tweet_matz/status/1501647192959946754?s=20&t=zXcsNIYR4AqSJ48uQ5P8Fw

In the morning	2 hours later

The user has no check ins in his CWA.

Here is the EN log: all-exposure-checks.txt

Device name: OnePlus 7Tpro (HD1913)
Android version: Android 11
App version: 2.18.1

@thomasaugsten I'm a bit concerned that this could affect many users, this should be investigated and if it is a Google bug reported to Google.

@larswmh Would you please be so kind and also mirror this comment to the internal ticket?

thomasaugsten commented 2 years ago

We aware of the situation that the Android ENF sometimes return 0 key matches. We reported this already but it is not reproducible why the android ENF behaves like this

Ein-Tim commented 2 years ago

@thomasaugsten But how exactly does this explain the change in risk status without a new check?

vaubaehn commented 2 years ago

That's indeed strange.

I'm throwing in some facts that may fit here - or may not... just as a starting point.

CWA is calculating the risk status very often in the day, not only once after Diagnosis Keys have been downloaded from server and DKs have been provided to ENF. When risk calculation is triggered, CWA receives ExposureWindows() and uses them to calculate the current risk. So, theoretically a risk status change may occur, eventhough no new DKs have been downloaded/provided to ENF, but there is an ENF internal update of the ExposureWindow data base.

So, the only plausible explanation that I see currently - when assuming no new error has been introduced by CWA update or ENS update - , that there may be a timeout in ENF. A very long timeout, that makes CWA think, after DK provision to ENS no new matches have been found, and after some time ENF recovers, updates ExposureWindows db, and then, when CWA pulls an ExposureWindow update, gets a new/additional high risk encounter submitted.

Edit: But if this theory was true, that problem should already have been present before - at least during January/February.

Ein-Tim commented 2 years ago

@vaubaehn Okay I understand so it is indeed possible and "normal" that the risk status might change without a new check carried out?

vaubaehn commented 2 years ago

@Ein-Tim It's not "normal" (a timeout should not happen) but theoretically possible.

But I'm puzzled with the "hin- und her" of the status change... Could you ask them, if they were out during the night on the days with high risk encounters (risk encounter in bar, pub...) - then there could also be a problem, that may be related to inconsistent dates (i.e., when a new day starts with regard to midnight). Also, for the first user, it's obvious from the all-exposures-log, that her downloads/provision of DKs are not that frequent as they could be... Could you ask her, whether she's often on mobile or turns her phone off often (if not, that could indicate more strongly towards stuck ENF).

vaubaehn commented 2 years ago

@thomasaugsten

We aware of the situation that the Android ENF sometimes return 0 key matches. We reported this already but it is not reproducible why the android ENF behaves like this

From my observation, there was a change in ENF introduced in the end of last year: Only when a provision of Diagnosis Keys leads to a new match from a fresh, not already before provided TEK, a number of matches > 0 is returned. Could you cross check, or if I'm wrong, present some example cases for the number of matches?

With this hypothesis in mind, looking to the exposure log of the first users, looks like she's often in the crowd in a hot-spot-area and collecting risk contacts by this. The last provision alone showed 9 risk contacts. According to what I wrote above, I assume they're all distinct people, or longer contacts across several exposure windows.

Ein-Tim commented 2 years ago

@vaubaehn

It's not "normal" (a timeout should not happen) but theoretically possible.

Yeah, right.

Could you ask them, if they were out during the night on the days with high risk encounters (risk encounter in bar, pub...) - then there could also be a problem, that may be related to inconsistent dates (i.e., when a new day starts with regard to midnight).

I can ask them, however, I don't understand how this would lead to the behavior encountered by them. Maybe you could elaborate a bit?

Could you ask her, whether she's often on mobile or turns her phone off often (if not, that could indicate more strongly towards stuck ENF).

I'll ask her right now.

Ein-Tim commented 2 years ago

@vaubaehn Her answer:

hab auf dem Weg zur Arbeit und zurück immer mobile Daten an, aus ist es selten. Hatte auch eine zeitlang Bluetooth zwischendurch aus […], aber das ist jetzt auch schon glaub ich zwei, drei Wochen her

vaubaehn commented 2 years ago

I can ask them, however, I don't understand how this would lead to the behavior encountered by them. Maybe you could elaborate a bit?

This is just a guess into the very blue: I'm thinking of risk contacts that span over midnight, e.g., one exposure window was received shortly before midnight, one exposure window after midnight. If there is "somewhere" (and I'm sorry I can't tell it more concrete atm) an inconsistency between dates in ExposureWindows() returned from ENF and dates used for risk calculation according to the day the risk contact counts for - when in the first calculation of the day all ExposureWindows of the risk contact are aggregated to day "1" and in the second calculation (for some reason) are spanned/calculated to day "1" and to day "2", that could lead to an unxepected change of the risk status. The inconsistency could be related to evaluation of dates like UTC and UTC+1 differently for ExposureWindows and risk calculation (we had these inconsistencies in the past). So my thought was, if both users respond like "yes, on that day I was out over midnight) it would be worth to look deeper if there can actually be an inconsistency, and if they say "no", then we can skip thinking about it further.

vaubaehn commented 2 years ago

@vaubaehn Her answer:

hab auf dem Weg zur Arbeit und zurück immer mobile Daten an, aus ist es selten. Hatte auch eine zeitlang Bluetooth zwischendurch aus […], aber das ist jetzt auch schon glaub ich zwei, drei Wochen her

But at work she's on wifi? "timestamp":"11. März 2022, 13:39", "timestamp":"11. März 2022, 04:22", "timestamp":"11. März 2022, 00:13", "timestamp":"10. März 2022, 18:56", "timestamp":"10. März 2022, 06:40", "timestamp":"9. März 2022, 23:28", "timestamp":"9. März 2022, 18:35", "timestamp":"9. März 2022, 06:25", "timestamp":"8. März 2022, 18:51", "timestamp":"8. März 2022, 02:41", "timestamp":"7. März 2022, 22:34", "timestamp":"7. März 2022, 00:08", ...and so on...

You see what I mean? There are large gaps without exposure checks. If she's actually most of the time on wifi, only shortly on mobile data, and she doesn't turn off her phone, then this pattern is not normal.

Ein-Tim commented 2 years ago

But at work she's on wifi?

I asked her.

You see what I mean?

Yeah, totally!

Regarding https://github.com/corona-warn-app/cwa-app-android/issues/4925#issuecomment-1065324933, okay, I understand what you mean & will ask!

Ein-Tim commented 2 years ago

Answer from user 2:

Am besagten Tag, 7.3., war ich nicht über Mitternacht unterwegs. Eigentlich hatte ich an dem Tag nur einen Kontakt, die Person ist aber nicht betroffen. Also weder positiv getestet noch rote CWA. Das macht es nochmal seltsamer.

vaubaehn commented 2 years ago

Am besagten Tag, 7.3., war ich nicht über Mitternacht unterwegs. Eigentlich hatte ich an dem Tag nur einen Kontakt, die Person ist aber nicht betroffen.

hmmm... It would have been the night from Sunday to Monday. That the user did not have a known contact to an infected person, makes it look like it's the "neighbourhood-warning" when receiving BLE through walls... So, actually nothing can be excluded, and in the same way there is no clear indicator for the root cause.

I think, in this case the all-exposures-log could be helpful. Do you think the user could provide it?

Theory:

users are living in places with (very) high infection rates
risk contacts may be most likely neighbours, people while shopping, people in public transport/cars in traffic jam, people in the queue at the testing site - everything that can be catched during the short interval ENF receives RPIs. But in sum, there are many.
Assumed, ENF is working reliably, there may be often status changes, usually retrieved directly after downloading Diagnosis Keys and provision to ENF. What we know is, that a few exposure windows with low risk at a certain time of the day (e.g., 2 people in public transport at 07:50am with low infectiousness) are summed up with exposure windows with low risk a bit later on that day (e.g., another 3 people at 08:05am with low infectiousness), and depending on the score may result in an increased risk overall. What I don't know is, whether the exposure windows that CWA pulls from ENF and evaluates in risk calculation are actually aged out on the fly on an hourly basis: That would mean, if you evaluate the risk for day x-14 with 5 exposure windows at 07:55am, a high risk may be calculated. At 08:00am 2 of these exposure windows age out, and the next calculation only takes 3 exposure windows into account, and calculates a low risk. The status change would be independent of any downloading/provision of Diagnosis Keys.
Assumed ENF is timing out often due to the load/amount of Diagnosis Keys, and if the above would apply, then everything would look even more confusing, when an increased risk (due to fresh TEKs that have been provided some time before) are evaluated and calculated for a new risk status with some delay.

@thomasaugsten : do you know, whether for risk calculation ExposureWindows retrieved from ENF may age out on an hourly basis?

Ein-Tim commented 2 years ago

@vaubaehn The log is included in https://github.com/corona-warn-app/cwa-app-android/issues/4925#issuecomment-1065222847, or do you mean something else?

vaubaehn commented 2 years ago

@vaubaehn The log is included in #4925 (comment), or do you mean something else?

@Ein-Tim Ah, sorry, I missed that. Yes, I meant this one. Ok, compared to the other user much less risk contacts (here: 17 matches in 15 days), but still a resonable amount overall (compared to me: 1 match in 6 months). So I think there is a higher likelihood of frequent status changes in the two observed cases.

thomasaugsten commented 2 years ago

Without risk check the risk status isn't changing Age out on Android is very exact on hour base Background check on Android depends on the settings of the OEM what they do on battery mode, low battery mode or other states What we see before the ENF state is switching between 0 matches and high risk matches and the user receive multiple time the high risk to no risk information. But so far no pattern visible

Ein-Tim commented 2 years ago

@thomasaugsten

Without risk check the risk status isn't changing

As you can see in the OP & in https://github.com/corona-warn-app/cwa-app-android/issues/4925#issuecomment-1065222847 this actually happens. Or am I misunderstanding you here?

vaubaehn commented 2 years ago

@thomasaugsten

Age out on Android is very exact on hour base

Thanks, good to know, this may explain the "sudden changes" to a lower risk within the day when when exposure windows of a certain hour age out.

What we see before the ENF state is switching between 0 matches and high risk matches and the user receive multiple time the high risk to no risk information. But so far no pattern visible

Could you elaborate technically, what exactly means "0 matches": are you referring to the number of ExposureWindows (objects) that are retrieved by getExposureWindows() (so in this case, getExposureWindows() returns NULL (no object))? Or do you refer to the number of matches that is displayed in Google's UI in the exposure history?

Without risk check the risk status isn't changing

My understanding is like this: Scheduled DownloadDiagnosisKeysTask is triggered > DK download > ENF provision of DKs > ENF sends intent (NO_MATCH or MATCH) > CWA receives intent and calls getExposureWindows() > risk calculation > change/no change of risk status

BUT: it looks like, CWA is also often processing getExposureWindows() (and most likely also does risk calculation after) independently of DownloadDiagnosisKeysTask: When we assume that CCTG didn't change CWA's behavior in risk checking etc., then microG reveals, that getExposureWindows()is called very often: Screenshot_2022-03-11-20-47-01

That would be statistically around 8 times per hour.

thomasaugsten commented 2 years ago

I refering to getExposureWindows but I cannot say the exact return value I think CCTG changed the calculation interval because they have not the 6 times per day limitation

vaubaehn commented 2 years ago

I refering to getExposureWindows but I cannot say the exact return value

Ok, thanks!

I think CCTG changed the calculation interval because they have not the 6 times per day limitation

Hm, I don't think they changed anything - you can see in the screenshot that provideDiagnosisKeys() is also only called every 4 to 6 hours. getExposureWindows()does not have any ENF quota.

vaubaehn commented 2 years ago

I refering to getExposureWindows but I cannot say the exact return value

Ok, thanks!

So when ENF returns NULL on getExposureWindows() when actually at least one object is expected, then the "easiest" explanation for me would again be ENF timeout...

thomasaugsten commented 2 years ago

Ok I only remember this change https://codeberg.org/corona-contact-tracing-germany/cwa-android/commit/e3e109145e4c1f61a9f6e700ddeb48ccfc2db348

I will double check about the outputs of getExposureWindows

vaubaehn commented 2 years ago

Ok I only remember this change https://codeberg.org/corona-contact-tracing-germany/cwa-android/commit/e3e109145e4c1f61a9f6e700ddeb48ccfc2db348

Very good catch! But strange anyways: My CCTG exactly behaves like CWA (every 4-6 hours download and provideDiagnosisKeys()) and microG reports the same. Just getExposureWindows() is called more often by factor ~40.

Maybe @fynngodau or @Bubu can comment, if they find time.

I will double check about the outputs of getExposureWindows

Perfect, thanks!

vaubaehn commented 2 years ago

@thomasaugsten Just found that your referenced CCTG PR "refresh at will" is merged to v2.14.4-dev. The last release is 2.14.3. https://codeberg.org/corona-contact-tracing-germany/cwa-android/pulls/229#issuecomment-375967 So, it's not armed yet in public releases.

Ein-Tim commented 2 years ago

2.14.4-dev has removed the refresh limitation and always carries out a risk check when opening the app.

Ein-Tim commented 2 years ago

But at work she's on wifi?

The user confirmed this. Also during the night her phone stays connected to the internet via WLAN.

Ein-Tim commented 2 years ago

@thomasaugsten

Good morning! I guess it snowed under yesterday, however, I want to come back to this. You stated in https://github.com/corona-warn-app/cwa-app-android/issues/4925#issuecomment-1065435060 that the risk status is not changing without risk checks.

However, as you can see in the OP of this issue and in https://github.com/corona-warn-app/cwa-app-android/issues/4925#issuecomment-1065222847 this is actually what is happening.

The risk status changes without a new risk check carried out.

Is this something we already saw in the past?

thomasaugsten commented 2 years ago

No for this case we need error report

Ein-Tim commented 2 years ago

@thomasaugsten

I will ask both of them to activate error logging in CWA. Please stand by.

Ein-Tim commented 2 years ago

@thomasaugsten

Second user already provided error log with the ID: B9EE2CF89D4067B533F3

This runs since yesterday, however, he's unsure whether it captured the change in the risk status without risk check.

He continues to log.

Ein-Tim commented 2 years ago

@thomasaugsten

The second user experienced the bug again & this time the error log defined captured it.

He saw at 09:10h that the app was green again and suspects that the risk check at 08:46h is responsible for this.

Screenshot from 09:10h:

Then, at 11:10h, he opened the app again, last check is still 08:46h and the app is red:

The error log which captured this has the ID 7529999FCB786D6FE3AE.

Will you take a look at this?

Ein-Tim commented 2 years ago

@thomasaugsten Did you have time to look at the log I shared the ID from in the comment above?

thomasaugsten commented 2 years ago

No findings yet

Ein-Tim commented 2 years ago

You mean there is nothing helpful in the log?

thomasaugsten commented 2 years ago

I mean we still checking for helpful things in the log

mlenkeit commented 2 years ago

fyi, we have found a few things in the logs that look suspicious but root cause analysis is still going on... it's not very obvious from the logs, so it takes some time...

Ein-Tim commented 2 years ago

@mlenkeit Thanks, let me know if you need more information from the users.

mlenkeit commented 2 years ago

The weighted time in the risk calculation is 0.6 + 4.8 + 3.5999999999999996, resulting in 8.9999999999999996.

The 3.5999999999999996 is actually 1.0 x 1.2 x 3.0, which should result in 3.6, in which case the weighted time would be 9.0.

The threshold for a high risk is exactly 9, so the risk should be high

The assumption is that due to how floating point numbers (i.e. doubles) are handled, the sequence, etc., that the weighted time sometimes results in 8.9 and sometimes in 9.0.

The proposal to fix this is to round the intermediate values to one decimal place.

ndegendogo commented 2 years ago

The assumption is that due to how floating point numbers (i.e. doubles) are handled, the sequence, etc., that the weighted time sometimes results in 8.9 and sometimes in 9.0.

wow...

@mlenkeit

Does this explain the observation that the status / color changed without a new risk calculation? Or was this statement not true, and a risk check (or at least a partial recalculation) was performed, but without an obvious / visible indication at the user interface?

Ein-Tim commented 2 years ago

PR https://github.com/corona-warn-app/cwa-app-android/pull/4943 is the fix for this. It targets the release/2.19.x branch.

mlenkeit commented 2 years ago

Does this explain the observation that the status / color changed without a new risk calculation? Or was this statement not true, and a risk check (or at least a partial recalculation) was performed, but without an obvious / visible indication at the user interface?

@ndegendogo AFAIK, the date on the UI (which is more or less the only visible indication) only changes when new key files were downloaded (i.e. the input for the risk calculation has changed). The risk calculation itself is actually executed more frequently.

So yes, it would explain the behavior.

ndegendogo commented 2 years ago

Thanks @mlenkeit !

jwildeboer commented 2 years ago

The iOS version exposes a similar behaviour, but as the risk calculation is visible in the journal, I could clearly pinpoint it to a specific date that changes between low risk and increased risk. Not sure why this feature isn't in the Android version.

Ein-Tim commented 2 years ago

Not sure why this feature isn't in the Android version.

It actually is.

Ein-Tim commented 2 years ago

I received no feedback from the user whether this issue was fixed or not, so I'm closing this issue due to missing feedback from the issue reporter.

corona-warn-app / cwa-app-android