Specs of the Risk Scoring function

Matioupi commented 4 years ago

Hello, I may have not searched deep enough, but I have not seen specifications for the risk scoring function. Is this part already in discussion / tested ? Any links or inputs to document my curiosity ? Regards

lbarman commented 4 years ago

Hi @Matioupi; nope not yet! Roughly speaking it's gonna be proportional with the number of encounters with sick people and their durations and "proximity" (via recorded RSSI power).

Sounds like a good idea for a pinned thread.

Matioupi commented 4 years ago

It will the somehow related to #204 and #188 then. No offense for the following image, but I have the feeling that there is a whole team designing and building a rocket here without knowing exactly what the payload will look like and what are the specs to carry it properly (e.g. if technically achievable RSSI to distance evaluations will be compatible with any useful epidemiology studies) I hope that the rocket currently on the workshop and designed to carry a useful earth observation satellite will not be reusable to some deadly bomb.

lbarman commented 4 years ago

No offense for the following image, but I have the feeling that there is a whole team designing and building a rocket here without knowing exactly what the payload will look like and what are the specs to carry it properly

None taken; if you have specific inputs on both "how this payload will look like and what specs are needed" they are very welcome :) we are aware (& working on) the problem.

I hope that the rocket currently on the workshop and designed to carry a useful earth observation satellite will not be reusable to some deadly bomb.

I assure you that this is one of our primary concerns; everything is public and if you spot such possibility, please let us know :)

edit: about your linked issues: Let's please try to keep a separation between Bluetooth contact-tracing and what we do on this issue: talking about how these contacts translate into a risk score. Thanks!

edit2: typo

Matioupi commented 4 years ago

Ok let's try to give inputs, but i'm neither an epidemiologist, nor a crypto/privacy specialist. Those are more like additional questions.

Should risk score function take as inputs :

only DP3T protocol data ?
additional user provided data like a "user profile" (uses public transportation / work in small office / how do you carry phone) that the user could select upon app installation.
additional local phone provided information e.g. phone attitude angle / acceleration / position / speed
additional user provided information that he/she could provide only at the scoring time (e.g. : do you want to provide additional information about the "configuration" at the time where you possibly met this contaminated person)
other sensors (e.g. ambient noise level ? ambiant light ?)
phone brand / model ?

Should risk evaluation function "sum up" the score of multiple contaminated proximity (or more precisely : how to handle multiple contaminated pseudonymes)

What type of output should the risk scoring function provide ? :

0/1 output
continuous 0% to 100% score giving the user a chance to choose what he/she will do next ?

Hope this was a useful contribution to start a more formal spec.

peterboncz commented 4 years ago

cross-posted from #188, probably more at place here (thanks @Matioupi)

rather than measuring distance, the app should measure danger, and compute a risk score for BLE logged contacts. Estimated distance is of course part of the score; interaction time is another.
Since different devices transmit BLE at different strengths, and make quality & damage probably cause variations; a list of phone models and their RSSI/Tx strengths (such as Singapore's BlueTrace has) might provide a starting point, but will not be good enough. The app should better have a calibration mode where you keep two mobile phones close (your housemate's), enforced e.g. by having them read a QR code from the other's screen, while measuring the RSSI/Tx strength of each other and communicating this back (could also use QR for that even ;-). Your devices calibrated RSSI/Tx strength should then be part of its broadcasted BLE packets so others can more reliably compute their distance from you.
this risk score would be epidemiologically more precise if beyond time and distance, it would leverage more features, e.g. ActivityTracking. Both iOS and Android detect the states: cycling, automotive, walking, stationary. The state (similar to calibrated BLE Tx strength) could also be part of the broadcasted BLE packet, where same-state encounters (of long duration and small distance) would be the most risky. Note: it would be great for the risk score if indoors/outdoors could also be sensed, but the best methodologies for that (#GSM stations, #ultrasound echos) is not available in iOS AFAIK.
The level of risk you have been exposed at could also be communicated to the user in the app, e.g.: you had X encounters with infected people; and if this is a red-flag-level risk say: we think it is advisable to get a test (possibly providing some code as proof for the health authorities). It is probably best not to expose the risk score itself to the user, as it will be confusing. But exposing the amount of encounters could be useful as feedback on one's behavioral pattern. It would also be helpful to show the date and rough time of the detected encounters to better understand which behaviors were dangerous. Of course, exposing this information leaks some privacy, so the time should be a rough indication only and leave room for doubt.
in matching centrally posted alerts (SKs) with your contact history and computing the risk scores these pose, a threshold needs to be applied to decide whether the found matches are red-flag-level in order to advise/invite for a test. As the app is deployed, this threshold value needs to be tuned by the health authorities. This threshold can also be used to safeguard against overwhelming testing capacity due to too many false positives (invite the top-N people at risk for a test, where N is testing capacity) by raising or dropping the threshold on a regular basis.
involve the user. If something unsafe happens, like somebody coughing nearby a lot in the bus, one could press a button in the app noting that, giving explicit context for the risk score. On entering a busy place, one might signal that. Similarly, one may explicitly declare certain places safe: e.g. allow to whitelist some WiFi connectivity where logged contacts are to be ignored (i.e. home) -- this will in fact also protect against erroneously catching the neighbours BLE signal through a wall or floor.

Matioupi commented 4 years ago

This document https://www.santepubliquefrance.fr/content/download/230088/file/20200221_COVID19_contact_non_hospitalier.pdf is an official french manual survey / contact tracing document.

The foreword is :

Un contact étroit à risque modéré à élevé est une personne qui a partagé le même lieu de vie (par exemple : famille, même chambre) que le cas confirmé ou a eu un contact direct avec lui, en face à face, à moins d’1 mètre du cas et/ou pendant plus de 15 minutes, au moment d’une toux, d’un éternuement ou lors d’une discussion ; flirt ; amis intimes ; voisins de classe ou de bureau ; voisins du cas dans un moyen de transport de manière prolongée ; personne prodiguant des soins à un cas confirmé ou personnel de laboratoire manipulant des prélèvements biologiques d’un cas confirmé, en l’absence de moyens de protection adéquats.

Not saying this is true because so many statements have limited lifetime these day, a good risk scoring function will probably be able to threshold both a distance and a continuous/cumulated contact time.

peterboncz commented 4 years ago

My French is a bit rusty, but I get that. Still I do think that on top of proximity and time, indoors vs outdoors makes a difference in the risk. A specific case of that, public transportation, would be caught by ActivityTracking classifying you as being in automotive state. Spending time in a bar/club/restaurant (stationary) regrettably cannot be sensed so easily, but users could self-annotate "indoors"; maybe prompted by the app, after logging of stationary contacts while not being in home WiFi.

By the way, whitelisting WiFi could also help people who may work with protective clothing (healthcare, but maybe more in the future, e.g. barbers) to avoid the app unneccesarily logging them. But simply checking and out, to block sequences of wearing protective clothing should also be an app feature, for this reason.

Matioupi commented 4 years ago

This https://www.ecdc.europa.eu/sites/default/files/documents/Contact-tracing-Public-health-management-persons-including-healthcare-workers-having-had-contact-with-COVID-19-cases-in-the-European-Union%E2%80%93second-update_0.pdf is some EU official English ressource of the same kind as the above provided french one.

winfried commented 4 years ago

When creating specs for the risk scoring function, also evaluate the error bandwidth that would still be acceptable.

helme commented 4 years ago

Hi all,

in #188 (alternative link) I already referenced our recently published work on risk estimation based on BLE (arxiv-link).

In this work we propose a simple yet powerful model for reliable risk estimation based on BLE. Please ask me in any case, I will answer asap. I could also provide code and maybe the data.

Best

winfried commented 4 years ago

@helme nice article and clearly defined steps. Can you for each step also include an estimate of the error introduced, for example by the assumption that risk of contamination is a function of proximity and exposure time or what precision the machine learning model can reach? Do you know, from an epidemiological point of view what errors are acceptable for the contact tracing to still be functional?

helme commented 4 years ago

@helme nice article and clearly defined steps. Can you for each step also include an estimate of the error introduced, for example by the assumption that risk of contamination is a function of proximity and exposure time or what precision the machine learning model can reach?

@winfried the errors introduced is only caused by noise on BLE measurements (assuming no that there is no label noise). If there would be no noise, the predictions would lineup to a straight line for the linear distance-model (in the plots in Figure 1 E). Non-linear distance-models are more tricky, but should behave similar.

Do you know, from an epidemiological point of view what errors are acceptable for the contact tracing to still be functional?

We already know, that a linear model is misleading (since a tiny amount a risk is aggregated even when you are very far apart), the sigmoid or the box model should be preferred. this was also observed and published in this science paper

does this answer your questions?

winfried commented 4 years ago

@helme I am very glad you mention the epidemiological paper by the Oxford group. If you read it well, you see that the error margin they deduce is very small. 30% of false negatives is already (literally) deadly according to their calculations. Compensating that with a bigger box, results in more false positives, up to the point where the app becomes unusable because of too many false positives. ("all of the population at risk"). So I guess you know the error margins are very narrow here.

When I asked about the errors you introduced, I meant errors introduced in all steps of your workflow, not just the BLE noise. Lets take your first step for example:

Define an epidemiological model to convert proximity time series to in-fection risk scores.

This already suggests there is a strong correlation between proximity time series and in-fection risk zones. But for example being in a closed room for a longer period of time with somebody who is contagious, even when the distance is way bigger then 2 meters, is likely to infect you. Being just centimetres apart but with a barrier or protection, makes it highly unlikely you get contaminated. It is also so that when you are in front of somebody who sneezes, it is fare more likely to get contaminated then when you are at somebodies back. So the sole step of approximating contamination risk with distance introduces errors. Did you ever take such errors into account?

Or take this step:

Train a machine learning (ML)model to estimate the infection risk.

We all know that ML models have a certain amount of precision, depending on the model used, the training set and the size of it, etc. I see no estimate of the precision that can be expected. And because you've read the Oxford paper, you know how important it is to stay within a narrow error bandwidth.

I really appreciate your take on this, but you would really help when you make an estimation of the errors introduced in every step and the cumulative error of the whole methodology.

helme commented 4 years ago

This already suggests there is a strong correlation between proximity time series and in-fection risk zones. But for example being in a closed room for a longer period of time with somebody who is contagious, even when the distance is way bigger then 2 meters, is likely to infect you. Being just centimetres apart but with a barrier or protection, makes it highly unlikely you get contaminated. It is also so that when you are in front of somebody who sneezes, it is fare more likely to get contaminated then when you are at somebodies back. So the sole step of approximating contamination risk with distance introduces errors. Did you ever take such errors into account?

I'm still convinced that the assumption that "proximity correlates with infection-risk" is a good starting point. Since we don't have any real world observations about true infections yet, we have to come up with a reasonable prior belief first before we can adjust it. Unfortunately we didn't studied your concerns yet, but this should be investigated indeed very soon (but I think that any obstacle in between two peers will dampen the signal and therefore will dampen the risk of infection, please correct me if I'm too naive here EDIT: according to #188 I am). Here you can clearly see the correlation between proximity and received signal strength:

data

But anyway, we use just use this assumption for labeling time-series of RSSI values as 1 (risky) or 0 (no risky), and depending on the "epidemiological function" (linear, box or sigmoid) and "reference sequence" (i.e. x minutes being closer than y meters) gives as one possible labeling. These functions are of course somewhat arbitrary, but all are monotonically decreasing with distance. In the end, this should be modeled by epidemiologists.

We all know that ML models have a certain amount of precision, depending on the model used, the training set and the size of it, etc. I see no estimate of the precision that can be expected. And because you've read the Oxford paper, you know how important it is to stay within a narrow error bandwidth.

In our work we assume that the reader knows how to read the ROC curves in figure 1 E (which allows statements about precision and recall. But here is an example for two "epidemiological function" (linear and sigmoid) row-wise. Different thresholds (of the respectively trained model) column-wise allow for trade-off between true and false positives rates:

results

So for example: For sigmoid (lower row) and a threshold of 0.2 we have a true positive of 82% at a false positives rate of 11%. In other words, we have only 18% false negatives (i.e. below the deadly 30%). A lower threshold is for panic modus (i.e. false positives doesn't matter that much), while a high threshold (e.g. 0.5) gives us a low false positives rate of 1% (but at the expense of much lower true positives rate of 40% i.e. 60% false negatives, which can maybe considered at the end of the pandemic).

I really appreciate your take on this, but you would really help when you make an estimation of the errors introduced in every step and the cumulative error of the whole methodology.

@winfried I hope this answers some of your questions, but please keep asking and reviewing about this topic, because I'm a novice in those topics (BLE and epidemiology), I just worked with this data as computer scientist with knowledge about data science and machine learning (buzzword-alarm!!), so take my notes with caution ;) But from my point of view there is very little need for machine learning here (very simple linear regression models trained with labeled data coming from standardized experiments, i.e. RSSI values with associated proximities or risk scores as label), we should keep things simple and try to come up with at least something simple&robust very soon. Currently this is treated a uni-variate time-series (only RSSI values), but of course this can (and should) be treated as multi-variate time-series (i.e. adding gyroscope and audio as proposed in #188).

winfried commented 4 years ago

@helme I really like your work on the false positives and false negatives of your methodology. It is first good analysis I have seen of the false positives and false negatives of calculating a distance. And I must admit: your estimate is less bad then I expected. Still wondering how it would perform in real world settings.

But all this makes the biggest error even more important: the discrepancy between "contamination distance" and "linear distance". And as you noticed: that is a huge problem because "contamination distance" depends on many factors. Almost all of them are non-constant and a part of them are unknown. Scoring the risk solely on distance introduces too many errors.

zukunft commented 4 years ago

@winfried Scoring the risk solely on distance introduces too many errors.

So you mean, as peterboncz suggested the activity (cycling, automotive, walking, stationary) and the position (indoors/outdoors) should be included in the risk scoring? It seems that there are not many real world data yet available, so it could be useful to update the risk scoring parameter later. To get real world data for adjusting the parameter, some users must expose more data and give up some privacy. Has someone discussed the idea of a kind of "Risk Scoring Calibration App"?

winfried commented 4 years ago

@zukunft and there are many more factors there: patterns in air movement, if somebody coughs, barriers etc. Yesterday a physician from an outbreak management team told me that even being tested negative doesn't guarantee you are not contagious. Behaviour is a better indicator of contamination risk then distance. But still it is a very rough indicator. None of these indicators will ever reach the precision needed for a contact tracing and quarantine scenario.

There is a methodology for "Risk Scoring Calibration", it is called contact tracing for scientific research. It won't stop the pandemic but it will provide us vital information. You don't need the DP-3T protocol for that, but sifting through your location history with a trained contact tracer may come in handy there.

AidanToase commented 4 years ago

Love this thread - in my opinion getting the risk factor "correct" could make a big difference to the number of people the app will need to isolate to hold back the disease - so the efficacy of the risk factor algorithm is critical. A badly calibrated risk factor could easily ask a large percentage of the populate to isolate and be self defeating.

But don't you need to centralise the contact data in someway to design a sensible algorithm which pushes this back to a centralised approach?

Is there a halfway house between a centralised approach and decentralised approach where you only centralise a sample of the contact data for the purpose of designing the risk factor algorithm... (or a larger sample to machine learning and really optimise algorithm?)

qthegreat3 commented 4 years ago

As I listen to the discussion, my question is, "What do we want the user to do, based on the score?"

Maybe I missed this on another thread, but based on risk, are we hoping that the user:

Self-quarantines
Goes to get tested
Both
something else

Are we using the score internally, to make a recommendation? For example, the app calculates the users risk and then makes a recommendation such as "Get Tested". And the user never sees the score.

OR

Do we show the score along with the recommendation?

OR

Do we just show a score?

Based on that, I think how the score is calculated makes a big difference.

For example, if we are just going to recommend that get tested....how big of a deal does it matter if we ere on the side of caution of "Get Tested" because you happened to be in "Bluetooth" proximity to a person who tested positive? We can have more error and refine as we get better.

If we recommend, "self quarantine" I could see recommendations having to be more accurate. That is more painful for the user.

But I think the question is, which is better? To get something out there where people can start being traced and make some errors with helpful recommendations or to have people wait longer for better recommendations?

I'm assuming that the Risk Score is going to be used for recommendation to the user. So based on that, how accurate does it need to be?

keugens commented 4 years ago

I would suggest a notification like this: Yesterday around 14.15 you had a contact with a 30% probability of being infected. Please contact your home health service for further advise.

So it is up to the user, his way of living including contacts to persons at risk, and the health service if quarantine or testing or something else is the most appropriate action. And not to forget cases, where a proximity was detected falsely due to any circumstances.

DP-3T / documents

Specs of the Risk Scoring function #235