"Risiko-Ermittlung gestoppt" though it had not been turned off

kereng5 commented 3 years ago

Describe the bug

gestoppt_5168 I did not stop the exposure check, and it was active at 2:46 am. So who stopped it and why?

One more small issue: When I click "Aktivieren" to reactivate the exposure check, switch the slider to the right and return to the previous screen by the "back triangle", it still shoes "Risiko-Ermittlung gestoppt", and the slider is left again. It works if I switch the slider to the right, close the CWA and start it new.

Steps to reproduce the issue

I just start the phone in the morning and read "Risiko-Ermittlung gestoppt" in the CWA. It happened today and on the day before yesterday, but not yesterday.

Technical details

Mobile device: Motorola G3
Android version: 6.0.1
CWA version 2.1.2
ENF 18211213000
no "Priorisierte Hintergrundaktivität"

Internal Tracking ID: EXPOSUREAPP-7095

Lu9Bro commented 3 years ago

On my device, the problem still occurs 1...3 times per day. This is a bit less frequent than the weeks before.

Lu9Bro commented 3 years ago

For several days now I’m trying to watch the screen while this problem will occur. But whenever I watch the CWA screen during download and processing of exposure data, I don’t get any problems. And when I don’t watch the screen for some hours, the problem has occurred once again in the meantime.

This brings me to the assumption: This problem does occur only if the device is “laid aside” (screen is off), and it does not occur while the device is “active” (screen is on, having made some clicks in CWA within the last minute). If so, the problem might be related to “background activity” and priorities.

Can other users ( @kereng5 , @cydur ) confirm or falsify my observation?

kereng5 commented 3 years ago

@Lu9Bro Yes, I had "Risiko-Ermittlung gestoppt" only when I started the phone in the morning (exept one time in the afternoon). When I start CWA, I often see "Daten werden heruntergeladen", but I never saw "Daten werden heruntergeladen" and then "Risiko-Ermittlung gestoppt".

cydur commented 3 years ago

@Lu9Bro Maybe. Since I'm trying to create an error report the problem didn't occur. This might be because I open the app much more frequently to check.

kereng5 commented 3 years ago

Today at 6 pm, I opened the CWA and it said "Daten werden heruntergeladen". After 15 minutes it showed "Aktualisiert Heute 11:56" again. The exposure log doesn't show the failed attempt, too. Maybe this has nothing to do with this issue.

Lu9Bro commented 3 years ago

Since 1 -2 weeks, this error did not occur again. I don't know if

it does not occur because the covid-19 incidence value has decreased much, or
because the ENF version has automatically updated from 18211515000 to 18212002104 in the meantime.

But now the "Daten werden heruntergeladen" and "Prüfung läuft" work faster then before. Much faster than expected (proportionally to the decreasing incidence values). I'm still using CWA version 2.2.1 , just to watch this issue a bit longer.

vaubaehn commented 3 years ago

Unfortunately this issue has not been fixed yet. There were some reports since release 2.5.1, here is the latest from Google Play Store:

Ich war bisher ganz zufrieden. Aber warum ist jetzt auf einmal mein Impf-Zertifikat weg? Auch stellt sich die Risiko Ermittlung immer wieder auf inaktiv und ich kann es nicht wieder aktivieren.

So it's even a combination with #3549

@d4rken Is the state of the switch of the Risiko-Ermittlung also part of another (HotData) flow?

d4rken commented 3 years ago

@d4rken Is the state of the switch of the Risiko-Ermittlung also part of another (HotData) flow?

No: https://github.com/corona-warn-app/cwa-app-android/blob/63df24be9d2bc1bfb324325d79b1c8b0f8606f4d/Corona-Warn-App/src/main/java/de/rki/coronawarnapp/nearby/modules/tracing/DefaultTracingStatus.kt

I don't think there is any direct relation between the two issues, except the device possibly being under a lot of load, causing the ENS communication to timeout/hang, and causing the certificates to hang or more just run into an edgecase more easily.

vaubaehn commented 3 years ago

@d4rken But it looks like a flow, and the state is emitted? https://github.com/corona-warn-app/cwa-app-android/blob/63df24be9d2bc1bfb324325d79b1c8b0f8606f4d/Corona-Warn-App/src/main/java/de/rki/coronawarnapp/nearby/modules/tracing/DefaultTracingStatus.kt#L98-L131

Or is the state requested on other places via suspend fun ?

I inspected my logs, and just from the log messages it looks like, the log messages of isTracingEnabled from the code above are not really fitting to what happened: At 06:26 UTC this morning, the watchdog called CWA. DKs were downloaded and checked. The submission to ENF took place in the same minute, but it took 5 minutes, until the task finished:

2021-07-22T06:41:20.036Z V/TaskControllerExtension: submitBlocking(request=DefaultTaskRequest(type=class de.rki.coronawarnapp.diagnosiskeys.download.DownloadDiagnosisKeysTask (Kotlin reflection is not available), arguments=de.rki.coronawarnapp.diagnosiskeys.download.DownloadDiagnosisKeysTask$Arguments@807aaec, id=66bc7652-ffb9-4731-bbba-0cfae5166cc2, originTag=ExposureWindowRiskWorkScheduler-WatchdogService, errorHandling=null)) continuing with result InternalTaskState(id=ec5d3955-935d-4ecb-97ef-7e3181ad2145, request=DefaultTaskRequest(type=class de.rki.coronawarnapp.diagnosiskeys.download.DownloadDiagnosisKeysTask (Kotlin reflection is not available), arguments=de.rki.coronawarnapp.diagnosiskeys.download.DownloadDiagnosisKeysTask$Arguments@807aaec, id=66bc7652-ffb9-4731-bbba-0cfae5166cc2, originTag=ExposureWindowRiskWorkScheduler-WatchdogService, errorHandling=null), createdAt=2021-07-22T06:36:21.608Z, startedAt=2021-07-22T06:36:21.611Z, finishedAt=2021-07-22T06:41:20.027Z, error=null, result=de.rki.coronawarnapp.diagnosiskeys.download.DownloadDiagnosisKeysTask$Result@c0b2faa, config=Config(executionTimeout=PT480S, collisionBehavior=SKIP_IF_SIBLING_RUNNING), job=LazyDeferredCoroutine{Completed}@bd8b419, task=de.rki.coronawarnapp.diagnosiskeys.download.DownloadDiagnosisKeysTask@67a7cde)

EDIT: According to ENS UI, the ENS started ~2021-07-22 06:40:00 UTC to check DKs.

In the same minute, when the task was started, isTracingEnabled was checked, and it failed:

2021-07-22T06:36:22.221Z D/DefaultDiagnosisKeyProv: Performing key submission.
2021-07-22T06:36:22.219Z I/DefaultENFVersion$isAtL: Comparing current ENF client version 18212418000 with 17000000
2021-07-22T06:36:26.548Z W/DefaultTracingStatus: Quick isEnabled check had timeout, retrying with more patience.
2021-07-22T06:36:26.546Z W/DefaultTracingStatus: Failed to determine tracing status.
kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 1000 ms
    at kotlinx.coroutines.TimeoutCoroutine.run(Timeout.kt:2)
    at kotlinx.coroutines.EventLoopImplBase$DelayedRunnableTask.run(EventLoop.common.kt:1)
    at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:21)
    at kotlinx.coroutines.DefaultExecutor.run(DefaultExecutor.kt:12)
    at java.lang.Thread.run(Thread.java:818)

kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 1000 ms
    at kotlinx.coroutines.TimeoutCoroutine.run(Timeout.kt:2)
    at kotlinx.coroutines.EventLoopImplBase$DelayedRunnableTask.run(EventLoop.common.kt:1)
    at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:21)
    at kotlinx.coroutines.DefaultExecutor.run(DefaultExecutor.kt:12)
    at java.lang.Thread.run(Thread.java:818)

I assume, that TimeoutCancellation was not intended in that form? Also there were no more log messages that could have been expected in that context, like https://github.com/corona-warn-app/cwa-app-android/blob/63df24be9d2bc1bfb324325d79b1c8b0f8606f4d/Corona-Warn-App/src/main/java/de/rki/coronawarnapp/nearby/modules/tracing/DefaultTracingStatus.kt#L113-L115

EDIT: What is also not clear for me, is, why the task just continued while https://github.com/corona-warn-app/cwa-app-android/blob/63df24be9d2bc1bfb324325d79b1c8b0f8606f4d/Corona-Warn-App/src/main/java/de/rki/coronawarnapp/nearby/modules/tracing/DefaultTracingStatus.kt#L146-L151 was aborted(?) with a timeout exception - so there was still a valid state for isTracingEnabled anyway (it was on, obviously)? From the logs, there was no more polling of the state within the 5 minutes the download task took to complete. EDIT2: Ok, I found your comment here: https://github.com/corona-warn-app/cwa-app-android/pull/3502#issue-673574450

The issue may look worse in some cases than it actually is. If the user never disabled tracing, tracing is still enabled, even if the isEnabled check has a timeout, but because the polling was canceled on timeout, the UI would have shown the last state in some cases (i.e. isEnabled=false due to timeout).

Also, what I did not expect, was that new quota was already calculated before DownloadDiagnosisKeysTask finished. The quota was already re-calculated at the beginning of the task:

2021-07-22T06:36:22.180Z D/SubmissionQuota: No new quota available (now=2021-07-22T06:36:22.179Z, availableAt=2021-07-23T00:00:00.000Z)
2021-07-22T06:36:22.180Z D/DefaultDiagnosisKeyProv: Consuming quota: old=5, new=4

If the ProvideDiagnosisKeys had been failed, we probably had lost one submission unnecessarily?

Side note: it does not happen often for me, that the DownloadDiagnosisKeysTask takes that long. Not sure, if it was for the first time after many weeks.

d4rken commented 3 years ago

@d4rken But it looks like a flow, and the state is emitted?

I may have misunderstood you then. I meant that it is not a HotDataFlow which is a custom Flow in the CWA. Yes it's also a Flow, just not a HotDataFlow.

I inspected my logs, and just from the log messages it looks like, the log messages of isTracingEnabled from the code above are not really fitting to what happened: At 06:26 UTC this morning, the watchdog called CWA. DKs were downloaded and checked.

The watchdog concept is still from before my time at the CWA, and I'm suspicious of it's effectiveness. Due to background reliability being such a hottopic, and there being "theoretically" no complaints (though it may be an unknown unknown), we've not taken taken it to the chopping block... The idea is that under some circumstances the GPlay services bind to the CWA, and give us guaranteed execution time. Google didn't explain in detail what is happening there, but it's using something like the service bind mechanism to elevate the CWA background priority to the priority of the play services (and actually start it too).

I think it binds the CWA process for a few seconds, meaning CoronaWarnApp.onCreate() executes and after a few seconds unbinds again, and then it's like opening the CWA and pressing the home button again (usual background priority behavior).

These few seconds are enough to schedule our workers again, but the WatchDogService actually starts a task right there: https://github.com/corona-warn-app/cwa-app-android/blob/63df24be9d2bc1bfb324325d79b1c8b0f8606f4d/Corona-Warn-App/src/main/java/de/rki/coronawarnapp/util/WatchdogService.kt#L50

After play-services unbind us, and if there is nothing else keeping us high-priority (UI or running workers), the system may just kill the CWA process again. The acquired WakeLocks, don't do anything to prevent that imho.

This all comes from a time long before we refactored the whole background execution. I don't think it currently helps, but I'm not sure if it hurts.

I assume, that TimeoutCancellation was not intended in that form? Also there were no more log messages that could have been expected in that context, like

The logged exception is a bit misleading, the order of log lines is not in order if you look closely:

Quick check happens, no answer within POLLING_DELAY_MS
The outer try catch shows the cancelation due to POLLING_DELAY_MS
Then because of that we perform a second check with a longer timeout (Quick isEnabled check had timeout)
If that also times out you would see Timeout on ENF side, assuming isEnabled false

This behavior is not unexpected on slow devices. It's just a cheap way to get a second check if the first one hangs.

Also, what I did not expect, was that new quota was already calculated before DownloadDiagnosisKeysTask finished. The quota was already re-calculated at the beginning of the task:
2021-07-22T06:36:22.180Z D/SubmissionQuota: No new quota available (now=2021-07-22T06:36:22.179Z, availableAt=2021-07-23T00:00:00.000Z)
2021-07-22T06:36:22.180Z D/DefaultDiagnosisKeyProv: Consuming quota: old=5, new=4
If the ProvideDiagnosisKeys had been failed, we probably had lost one submission unnecessarily?

We don't actually know whether a failed "provideDiagnosisKeys" (if it fails on ENS side) counts against the quota or not, we just assume. In any case the quota calculation was actually never armed, it still just counts. Exceeding the quota is currently just prevent by scheduling.

Please see also: #3502 (comment)

Side note: it does not happen often for me, that the DownloadDiagnosisKeysTask takes that long. Not sure, if it was for the first time after many weeks.

I think it's a good idea to do this, but I would suggest to do it the other way round: first check 2min timeout, 2nd check 1s. Why? It's very plausible that the system as a whole or just the ENF is too busy "with something" that it can't consume the API call or respond within timeout time. For the cases where this is going to happen, it would mean, that the now proposed first check (1s) fails due to system/ENS being busy. The second call will follow immediately after, and there is a high likelihood, that system/ENS is still busy and will also not answer the second call in time. If you do it the other way round, (2min timeout, then 1s) the likelihood is high, that system/ENS is already recovered when the 2nd call comes in, and can respond. What do you think?

I don't think the play services throw away the api calls, so if it recovers we should still get an answer within the timeout window of the first call. There may be a chance that the 2nd 1second call succeeds, but if we fail a quick call, and then do a long one (like it is now), the chance to get an answer, if the system is busy, would be equally high. I don't think two requests is enough to DDOS the system, and I don't think play services punish callers for this. The downside of changing the order, is that if it is any other issue than "system is busy", then the UI has to wait 2 minutes for the second try and would just be blank until then?

I was also wondering, what can make ENS that busy not to respond in time. For me the most plausible explanation would be heavy work load due to verifiying the signature of diagnosis keys, shortly after provision from CWA to ENS.

Not sure how threading/processes works in the play services, could also be totally unrelated, e.g. an app updating, other apps communicating with play services etc...

Last question: I guess the 2min timeout is set by CWA?

No by the ENS (or a general gplay services timeout). The CWA would timeout at 7-10min if the call itself was part of background execution (workers capped at 10min, TaskController tasks at 7min).

vaubaehn commented 2 years ago

Hi @mtwalli , @thomasaugsten and all, the ""Risiko-Ermittlung gestoppt" though it had not been turned off" is back.

In the past weeks I found a low number of reports on Google Play Store, and last night I was experiencing it myself: At 23:59 CET (22:59 UTC) I checked the ENF logs, found last risk check at 19:56 CET. I jumped to CWA using the app selector ("square"), and found CWA presenting risk calculation switch/card in "stopped" state:

Just took screenshot and jumped back to ENF log:

I checked the number of diagnosis key that have been matched successfully before:

I jumped back to CWA using app selector, and found CWA waiting for a response from ENF now:

A short time later, CWA got a response from ENF:

Matching completed successfully:

Error log was (like always) running in background, and I had a look: It was clear, that the both timeouts (1-2 check: #3502) timed out, so tracingEnabled was emitted as false. For some reason, another timeout check was started soon after. This time the quick timeout check failed, but ENF did not time out, so the matching and retrieving results succeeded. If you want to have a look yourself, you will find the error log with ID 823002ED93931D75AFC8 - interesting part starts at 2021-11-29T22:56:49.977Z in line 86716.

While I'm glad that Exposure Checks recovered without interventions, two questions came up:

Do we need to be afraid, that ENF will timeout more often with increasing numbers of diagnosis keys, and CWA can't recover so that no risk calculation succeeds anymore, especially on slow devices?
Is is it possible to fine tune the whole process that users won't see the the "Risiko-Ermittlung gestoppt" anymore? Inceasing quick check timeout won't help I guess, as ENF is the bottle neck here. Is it actually necessary to wait for the ENF to finish, or could it be enough to start risk calculation/displaying results, when the intent from ENF is received?

By the way, I found also a (temporary) problem with statistics: security exception because the download file could not be validated with its public key at 2021-11-29T22:59:26.954Z. How can this happen?

thomasaugsten commented 2 years ago

The ENF is sometime too slow and this can happens especially when the app is in the background and device is under load. The ENF is not timing out properly. Often it starts without any result.

In theory the security exception cannot happen. Maybe if the downloaded file is corrupted and he can still unzip it.

vaubaehn commented 2 years ago

Hi @mtwalli , There is an increasing amount of reports in Google Play Store currently that complain about the same issue here. Is there anything you can do, e.g. circumvent the 2-minute-GMS timeout?

Ein-Tim commented 2 years ago

I'm in contact with a Twitter user who encounters this bug, anything she could do to solve it?

dsarkar commented 2 years ago

@Ein-Tim The user can't activate the exposure logging? He flipped the toggle several times (in OS settings) and it always sets back to "deactivated"?

Ein-Tim commented 2 years ago

@dsarkar

After sole time with active exposure logging, the app shows "Risiko-Ermittlung gestoppt" Toggling exposure logging on/off then, doesn't do anything. Clicking on "Risiko-Ermittlung aktivieren", on the card sometimes works and it works again, but sometimes the app still shows "Risiko-Ermittlung gestoppt".

Ein-Tim commented 2 years ago

And another Twitter user reporting this: https://twitter.com/helgol/status/1483534553235050515?s=20

Device name: Galaxy XCover 3 (SM-G389F) Android version: Android 6.0.1 CWA version: 2.16.2

The main screen shows this: FJjns9FVQAUjS2o

And after trying to switch exposure logging on again, the user sees this pop up:

FJjntJzVgAIFV72

FYI @vaubaehn

Ein-Tim commented 2 years ago

The user on Twitter captured the error in an error log with the ID A7011DA8AD880A4BEBDF

vaubaehn commented 2 years ago

@thomasaugsten @mtwalli I could reproduce stopped Risiko-Ermittlung and recovery: obviously CWA did not receive any feedback from ENF API during some action in background and thus considered the ENF to be turned off. I was not able to turn on Risiko-Ermittlung via the button on top of the home nor via the CWA's settings menu. The best evidence for the missing link to the API was, that there was no ENF version number displayed on the bottom of the CWA's App-Information page (as the ENF version is only displayed when CWA thinks API is connected). This already happened in the past, and I was able to recover from this state by force closing the app, and restarting, then seeing a white blank page for a minute or so, and then getting back to work. This time I could recover by turning off bluetooth (there was no visible feedback on the home screen about turned off bluetooth!) and turning it on again. This obviously caused an intent received by CWA that re-created the (work)flow, and the switch of Risiko-Ermittlung automatically switched on again. I recorded an error log of all these steps, from switching CWA to foreground, finding Risiko-Ermittlung stopped, trying to turn it on again without success, then turning off bluetooth and back on again. The white risk card that informed that Risiko-Ermittlung is stopped turned to the state that download from server failed (which was not the case). After the Risk-Ermittlung turned back to on, the risk card switched to the green state. Fehlerbericht: 2022.02.04 - 17:37:13 ID 3E445FFE70BAA4906366

Possible fix: enhance/fix UI/Flow that trying to switch on button for Risiko-Ermittlung actually causes CWA to re-connect to ENF API, or implement a regular task via task scheduler.

thomasaugsten commented 2 years ago

We saw this already in error reports that the app receive a timeout from the ENF when it try to figure if the ENF is active or not. We are not sure what a good solution is to recover the ENF from this state. Your solution was to manually turn off and on bluetooth.

Normally the app should reconnect to the ENF on every risk calculation.

svengabr commented 2 years ago

I have just checked the linked Jira issue.

Jira Ticket is flagged as: Resolution: Wont Fix Status: Obsolete

Developer comment:

Closing due to logs unfortunately not providing the necessary information to be able to reproduce this.

corona-warn-app / cwa-app-android