Closed Ein-Tim closed 2 years ago
Thanks for forwarding this Twitter report @Ein-Tim. We have created an internal ticket for this and will raise this topic internally. Internal Tracking ID: EXPOSUREAPP-12211
Corona-Warn-App Open Source Team
CONFIRMED Same behavior by a second Twitter user: https://twitter.com/tweet_matz/status/1501647192959946754?s=20&t=zXcsNIYR4AqSJ48uQ5P8Fw
In the morning | 2 hours later |
---|---|
The user has no check ins in his CWA.
Here is the EN log: all-exposure-checks.txt
@thomasaugsten I'm a bit concerned that this could affect many users, this should be investigated and if it is a Google bug reported to Google.
@larswmh Would you please be so kind and also mirror this comment to the internal ticket?
We aware of the situation that the Android ENF sometimes return 0 key matches. We reported this already but it is not reproducible why the android ENF behaves like this
@thomasaugsten But how exactly does this explain the change in risk status without a new check?
That's indeed strange.
I'm throwing in some facts that may fit here - or may not... just as a starting point.
CWA is calculating the risk status very often in the day, not only once after Diagnosis Keys have been downloaded from server and DKs have been provided to ENF. When risk calculation is triggered, CWA receives ExposureWindows()
and uses them to calculate the current risk. So, theoretically a risk status change may occur, eventhough no new DKs have been downloaded/provided to ENF, but there is an ENF internal update of the ExposureWindow data base.
So, the only plausible explanation that I see currently - when assuming no new error has been introduced by CWA update or ENS update - , that there may be a timeout in ENF. A very long timeout, that makes CWA think, after DK provision to ENS no new matches have been found, and after some time ENF recovers, updates ExposureWindows db, and then, when CWA pulls an ExposureWindow update, gets a new/additional high risk encounter submitted.
Edit: But if this theory was true, that problem should already have been present before - at least during January/February.
@vaubaehn Okay I understand so it is indeed possible and "normal" that the risk status might change without a new check carried out?
@Ein-Tim It's not "normal" (a timeout should not happen) but theoretically possible.
But I'm puzzled with the "hin- und her" of the status change... Could you ask them, if they were out during the night on the days with high risk encounters (risk encounter in bar, pub...) - then there could also be a problem, that may be related to inconsistent dates (i.e., when a new day starts with regard to midnight). Also, for the first user, it's obvious from the all-exposures-log, that her downloads/provision of DKs are not that frequent as they could be... Could you ask her, whether she's often on mobile or turns her phone off often (if not, that could indicate more strongly towards stuck ENF).
@thomasaugsten
We aware of the situation that the Android ENF sometimes return 0 key matches. We reported this already but it is not reproducible why the android ENF behaves like this
From my observation, there was a change in ENF introduced in the end of last year: Only when a provision of Diagnosis Keys leads to a new match from a fresh, not already before provided TEK, a number of matches > 0 is returned. Could you cross check, or if I'm wrong, present some example cases for the number of matches?
With this hypothesis in mind, looking to the exposure log of the first users, looks like she's often in the crowd in a hot-spot-area and collecting risk contacts by this. The last provision alone showed 9 risk contacts. According to what I wrote above, I assume they're all distinct people, or longer contacts across several exposure windows.
@vaubaehn
It's not "normal" (a timeout should not happen) but theoretically possible.
Yeah, right.
Could you ask them, if they were out during the night on the days with high risk encounters (risk encounter in bar, pub...) - then there could also be a problem, that may be related to inconsistent dates (i.e., when a new day starts with regard to midnight).
I can ask them, however, I don't understand how this would lead to the behavior encountered by them. Maybe you could elaborate a bit?
Could you ask her, whether she's often on mobile or turns her phone off often (if not, that could indicate more strongly towards stuck ENF).
I'll ask her right now.
@vaubaehn Her answer:
hab auf dem Weg zur Arbeit und zurück immer mobile Daten an, aus ist es selten. Hatte auch eine zeitlang Bluetooth zwischendurch aus […], aber das ist jetzt auch schon glaub ich zwei, drei Wochen her
I can ask them, however, I don't understand how this would lead to the behavior encountered by them. Maybe you could elaborate a bit?
This is just a guess into the very blue: I'm thinking of risk contacts that span over midnight, e.g., one exposure window was received shortly before midnight, one exposure window after midnight. If there is "somewhere" (and I'm sorry I can't tell it more concrete atm) an inconsistency between dates in ExposureWindows() returned from ENF and dates used for risk calculation according to the day the risk contact counts for - when in the first calculation of the day all ExposureWindows of the risk contact are aggregated to day "1" and in the second calculation (for some reason) are spanned/calculated to day "1" and to day "2", that could lead to an unxepected change of the risk status. The inconsistency could be related to evaluation of dates like UTC and UTC+1 differently for ExposureWindows and risk calculation (we had these inconsistencies in the past). So my thought was, if both users respond like "yes, on that day I was out over midnight) it would be worth to look deeper if there can actually be an inconsistency, and if they say "no", then we can skip thinking about it further.
@vaubaehn Her answer:
hab auf dem Weg zur Arbeit und zurück immer mobile Daten an, aus ist es selten. Hatte auch eine zeitlang Bluetooth zwischendurch aus […], aber das ist jetzt auch schon glaub ich zwei, drei Wochen her
But at work she's on wifi? "timestamp":"11. März 2022, 13:39", "timestamp":"11. März 2022, 04:22", "timestamp":"11. März 2022, 00:13", "timestamp":"10. März 2022, 18:56", "timestamp":"10. März 2022, 06:40", "timestamp":"9. März 2022, 23:28", "timestamp":"9. März 2022, 18:35", "timestamp":"9. März 2022, 06:25", "timestamp":"8. März 2022, 18:51", "timestamp":"8. März 2022, 02:41", "timestamp":"7. März 2022, 22:34", "timestamp":"7. März 2022, 00:08", ...and so on...
You see what I mean? There are large gaps without exposure checks. If she's actually most of the time on wifi, only shortly on mobile data, and she doesn't turn off her phone, then this pattern is not normal.
But at work she's on wifi?
I asked her.
You see what I mean?
Yeah, totally!
Regarding https://github.com/corona-warn-app/cwa-app-android/issues/4925#issuecomment-1065324933, okay, I understand what you mean & will ask!
Answer from user 2:
Am besagten Tag, 7.3., war ich nicht über Mitternacht unterwegs. Eigentlich hatte ich an dem Tag nur einen Kontakt, die Person ist aber nicht betroffen. Also weder positiv getestet noch rote CWA. Das macht es nochmal seltsamer.
Am besagten Tag, 7.3., war ich nicht über Mitternacht unterwegs. Eigentlich hatte ich an dem Tag nur einen Kontakt, die Person ist aber nicht betroffen.
hmmm... It would have been the night from Sunday to Monday. That the user did not have a known contact to an infected person, makes it look like it's the "neighbourhood-warning" when receiving BLE through walls... So, actually nothing can be excluded, and in the same way there is no clear indicator for the root cause.
I think, in this case the all-exposures-log could be helpful. Do you think the user could provide it?
Theory:
@thomasaugsten : do you know, whether for risk calculation ExposureWindows retrieved from ENF may age out on an hourly basis?
@vaubaehn The log is included in https://github.com/corona-warn-app/cwa-app-android/issues/4925#issuecomment-1065222847, or do you mean something else?
@vaubaehn The log is included in #4925 (comment), or do you mean something else?
@Ein-Tim Ah, sorry, I missed that. Yes, I meant this one. Ok, compared to the other user much less risk contacts (here: 17 matches in 15 days), but still a resonable amount overall (compared to me: 1 match in 6 months). So I think there is a higher likelihood of frequent status changes in the two observed cases.
Without risk check the risk status isn't changing Age out on Android is very exact on hour base Background check on Android depends on the settings of the OEM what they do on battery mode, low battery mode or other states What we see before the ENF state is switching between 0 matches and high risk matches and the user receive multiple time the high risk to no risk information. But so far no pattern visible
@thomasaugsten
Without risk check the risk status isn't changing
As you can see in the OP & in https://github.com/corona-warn-app/cwa-app-android/issues/4925#issuecomment-1065222847 this actually happens. Or am I misunderstanding you here?
@thomasaugsten
Age out on Android is very exact on hour base
Thanks, good to know, this may explain the "sudden changes" to a lower risk within the day when when exposure windows of a certain hour age out.
What we see before the ENF state is switching between 0 matches and high risk matches and the user receive multiple time the high risk to no risk information. But so far no pattern visible
Could you elaborate technically, what exactly means "0 matches": are you referring to the number of ExposureWindows (objects) that are retrieved by getExposureWindows()
(so in this case, getExposureWindows()
returns NULL
(no object))? Or do you refer to the number of matches that is displayed in Google's UI in the exposure history?
Without risk check the risk status isn't changing
My understanding is like this:
Scheduled DownloadDiagnosisKeysTask
is triggered > DK download > ENF provision of DKs > ENF sends intent (NO_MATCH or MATCH) > CWA receives intent and calls getExposureWindows()
> risk calculation > change/no change of risk status
BUT: it looks like, CWA is also often processing
getExposureWindows()
(and most likely also does risk calculation after) independently of DownloadDiagnosisKeysTask
:
When we assume that CCTG didn't change CWA's behavior in risk checking etc., then microG reveals, that getExposureWindows()
is called very often:
That would be statistically around 8 times per hour.
I refering to getExposureWindows but I cannot say the exact return value I think CCTG changed the calculation interval because they have not the 6 times per day limitation
I refering to getExposureWindows but I cannot say the exact return value
Ok, thanks!
I think CCTG changed the calculation interval because they have not the 6 times per day limitation
Hm, I don't think they changed anything - you can see in the screenshot that provideDiagnosisKeys() is also only called every 4 to 6 hours.
getExposureWindows()
does not have any ENF quota.
I refering to getExposureWindows but I cannot say the exact return value
Ok, thanks!
So when ENF returns NULL
on getExposureWindows() when actually at least one object is expected, then the "easiest" explanation for me would again be ENF timeout...
Ok I only remember this change https://codeberg.org/corona-contact-tracing-germany/cwa-android/commit/e3e109145e4c1f61a9f6e700ddeb48ccfc2db348
I will double check about the outputs of getExposureWindows
Ok I only remember this change https://codeberg.org/corona-contact-tracing-germany/cwa-android/commit/e3e109145e4c1f61a9f6e700ddeb48ccfc2db348
Very good catch! But strange anyways: My CCTG exactly behaves like CWA (every 4-6 hours download and provideDiagnosisKeys()
) and microG reports the same. Just getExposureWindows()
is called more often by factor ~40.
Maybe @fynngodau or @Bubu can comment, if they find time.
I will double check about the outputs of getExposureWindows
Perfect, thanks!
@thomasaugsten Just found that your referenced CCTG PR "refresh at will" is merged to v2.14.4-dev. The last release is 2.14.3. https://codeberg.org/corona-contact-tracing-germany/cwa-android/pulls/229#issuecomment-375967 So, it's not armed yet in public releases.
2.14.4-dev has removed the refresh limitation and always carries out a risk check when opening the app.
But at work she's on wifi?
The user confirmed this. Also during the night her phone stays connected to the internet via WLAN.
@thomasaugsten
Good morning! I guess it snowed under yesterday, however, I want to come back to this. You stated in https://github.com/corona-warn-app/cwa-app-android/issues/4925#issuecomment-1065435060 that the risk status is not changing without risk checks.
However, as you can see in the OP of this issue and in https://github.com/corona-warn-app/cwa-app-android/issues/4925#issuecomment-1065222847 this is actually what is happening.
The risk status changes without a new risk check carried out.
Is this something we already saw in the past?
No for this case we need error report
@thomasaugsten
I will ask both of them to activate error logging in CWA. Please stand by.
@thomasaugsten
Second user already provided error log with the ID: B9EE2CF89D4067B533F3
This runs since yesterday, however, he's unsure whether it captured the change in the risk status without risk check.
He continues to log.
@thomasaugsten
The second user experienced the bug again & this time the error log defined captured it.
He saw at 09:10h that the app was green again and suspects that the risk check at 08:46h is responsible for this.
Screenshot from 09:10h:
Then, at 11:10h, he opened the app again, last check is still 08:46h and the app is red:
The error log which captured this has the ID 7529999FCB786D6FE3AE.
Will you take a look at this?
@thomasaugsten Did you have time to look at the log I shared the ID from in the comment above?
No findings yet
You mean there is nothing helpful in the log?
I mean we still checking for helpful things in the log
fyi, we have found a few things in the logs that look suspicious but root cause analysis is still going on... it's not very obvious from the logs, so it takes some time...
@mlenkeit Thanks, let me know if you need more information from the users.
The weighted time in the risk calculation is 0.6 + 4.8 + 3.5999999999999996
, resulting in 8.9999999999999996
.
The 3.5999999999999996
is actually 1.0 x 1.2 x 3.0
, which should result in 3.6
, in which case the weighted time would be 9.0
.
The threshold for a high risk is exactly 9
, so the risk should be high
The assumption is that due to how floating point numbers (i.e. doubles) are handled, the sequence, etc., that the weighted time sometimes results in 8.9
and sometimes in 9.0
.
The proposal to fix this is to round the intermediate values to one decimal place.
The assumption is that due to how floating point numbers (i.e. doubles) are handled, the sequence, etc., that the weighted time sometimes results in 8.9 and sometimes in 9.0.
wow...
@mlenkeit
Does this explain the observation that the status / color changed without a new risk calculation? Or was this statement not true, and a risk check (or at least a partial recalculation) was performed, but without an obvious / visible indication at the user interface?
PR https://github.com/corona-warn-app/cwa-app-android/pull/4943 is the fix for this. It targets the release/2.19.x
branch.
Does this explain the observation that the status / color changed without a new risk calculation? Or was this statement not true, and a risk check (or at least a partial recalculation) was performed, but without an obvious / visible indication at the user interface?
@ndegendogo AFAIK, the date on the UI (which is more or less the only visible indication) only changes when new key files were downloaded (i.e. the input for the risk calculation has changed). The risk calculation itself is actually executed more frequently.
So yes, it would explain the behavior.
Thanks @mlenkeit !
The iOS version exposes a similar behaviour, but as the risk calculation is visible in the journal, I could clearly pinpoint it to a specific date that changes between low risk and increased risk. Not sure why this feature isn't in the Android version.
Not sure why this feature isn't in the Android version.
It actually is.
I received no feedback from the user whether this issue was fixed or not, so I'm closing this issue due to missing feedback from the issue reporter.
Avoid duplicates
Technical details
Describe the bug
I have a Twitter user reporting the following: She received an increased risk encounter for the 04.03.22 two days ago. Since she received the encounter, the app switches between increased risk and low risk.
See these screenshots from today, both were taken with the last risk check carried out at 04:22h:
Steps to reproduce the issue
I'm not sure whether this happens with all risk encounters at the moment or what exactly could be done to reproduce this. I did not identify any special things while writing with the user on Twitter.
Expected behaviour
The app should never switch between different risk statuses without carrying out a new check.
Additional context
Pings
@thomasaugsten Please take a look & let me know if you want a CWA error log. @dsarkar, @larswmh or @svengabr Please mirror this to JIRA ASAP. FYI @vaubaehn @mtwalli
Internal Tracking ID: EXPOSUREAPP-12211