DP-3T / documents

Decentralized Privacy-Preserving Proximity Tracing -- Documents
2.25k stars 180 forks source link

Bluetooth CT, Walls, False-positives, Distances, RSSI, etc. #188

Open nadimkobeissi opened 4 years ago

nadimkobeissi commented 4 years ago

I'm very surprised that I couldn't find an existing issue for this and it's most likely that it's my fault (I checked through all open issues twice), so please do close this issue if it's a duplicate or if it's already resolved.

My main concern with DP-3T currently is not a privacy/security one but actually regarding the harsh physical limitations of the Bluetooth LE communication channel. Is it really sufficient for determining actual contact events?

Consider for example: I drive around and stop at the traffic light next to someone in their own car who's infected. Thanks to how EphID exchange works over Bluetooth LE, now I've "been in contact" and our devices have exchanged EphIDs, despite neither of us having left their car.

Many such examples can be imagined.

My concern is that this would lead to such a large number of false contact events that it would bring down the usefulness of DP-3T almost completely, despite it not being a blight from a privacy or security standpoint.

What existing measures are being considered to mitigate this class of problems?

lbarman commented 4 years ago

Hi @kaepora; thanks for your comment. In short, we don't know yet how well Bluetooth contact-tracing itself will work, this is more a choice because other alternatives are (in our opinion) less privacy-preserving. This is a hard problem for sure; if you have specific inputs, please let us know and we'll forward it to the appropriate team ! Thanks

nadimkobeissi commented 4 years ago

I don't have anything in the way of a solution right now, but discourage the assignment of the will-close-soon-without-further-input tag to this issue. It's a bad idea not to make this discussion a central topic given that it can have a central impact on the viability of DP-3T as a whole, and especially given that...

we don't know yet how well Bluetooth contact-tracing itself will work

...there is a lack of empirical data on the viability of the primary wireless communication standard being considered.

I think it's important to keep this discussion going until the following steps have been taken:

  1. Real-world testing of the reliability of DP-3T using Bluetooth LE against false positives,
  2. Listing of alternative wireless communication standards to be tested,
  3. Comparison in terms of false positives, power consumption, availability, privacy-preserving properties of competing wireless communications standards.
lbarman commented 4 years ago

Sure, and so my general answer is the same as in https://github.com/DP-3T/documents/issues/41 and https://github.com/DP-3T/documents/issues/49 : we know this is a hard problem and specific inputs are much appreciated ! Ideally this should be in the FAQ; in the meantime, let me hijack your issue to keep one open about the Bluetooth difficulties :)

EDIT: We are of:

leenaars commented 4 years ago

Use of near-ultrasound in addition to Bluetooth could be helpful to reduce the amount of false positives. See: https://nlnet.nl/project/Simmel or the Austrian Red Cross app.

lbarman commented 4 years ago

@leenaars thanks! moving ultrasound discussions to https://github.com/DP-3T/documents/issues/185 if you don't mind :)

AlejandroUPCT commented 4 years ago

Hi all, as an engineer researching on tracking and positioning technologies with IoT, we need to formalize the technical problem of proximity (in parallel with the privacy and security issues) and define the use cases and technical requirements from the epidemiologist experts. For example; proximity is 5 meters, 10 or 30 meters and for how long? Is the same outdoors, semi-outdoor or indoors? Etc, etc. I bet on Bluetooth Low Energy and we can do a lot of research effort in signal processing and machine learning but we need measurable goals to focus on.

My point of view is to question the epidemiologist experts to define the main use cases. the important metrics of proximity and parameters taking into account the infectious models.

Besides, these well-defined use cases and metrics of proximity will allow to test and evaluate the performance of every solution.

Hope that help and best regards from Spain, Alejandro PD: Sorry if is already defined, I'm doing my best to stay update in these forum and documentation.

strubbi77 commented 4 years ago

For example; proximity is 5 meters, 10 or 30 meters and for how long? Is the same outdoors, semi-outdoor or indoors? Etc, etc. I bet on Bluetooth Low Energy and we can do a lot of research effort in signal processing and machine learning but we need measurable goals to focus on.

My point of view is to question the epidemiologist experts to define the main use cases. the important metrics of proximity and parameters taking into account the infectious models.

The only information which is missing is the strength of the antenna of the mobile from A. Because the EphID is changed every minute we know how long the contact was. The parameters can be changed later, after some testing of the app is done.

lbarman commented 4 years ago

BlueTrace made some measurements about strengths of transmissions for various phones: https://github.com/opentrace-community/opentrace-calibration/blob/master/Device%20Data.csv

kholtman commented 4 years ago

@Ibarman :

In short, we don't know yet how well Bluetooth contact-tracing itself will work,

I understand you are speaking for the DP-3D project here. Thanks for being frank about your status on this matter.

this is more a choice because other alternatives are (in our opinion) less privacy-preserving. This is a hard problem for sure; if you have specific inputs, please let us know and we'll forward it to the appropriate team !

As requested, here is some detailed expert input on the hard problem.

In various comments, others have already pointed out specific mechanisms and hurdles that make using Bluetooth for distance measurements between phones hard. I will not add to that list here.

My goal in this writeup is to document my my best current expert guess about likely system level performance for the April 12 white paper design. This means estimating false positive and false negative rates in the envisaged contact warning messages to users. I will also document some implications of my best current guess, implications for systems design and for the problem of broader communication to the EU public.

TL;DR: An app relying exclusively on Bluetooth measurements will likely not be useful. A pivot to a multi-sensor approach is the most promising solution direction.

More details below.

Credentials, methods, limitations

I will first say something about my credentials as an expert. In the last 15 years, I have worked several times in projects that developed and/or evaluated location tracking systems. These projects were done in various industrial R&D lab settings, in labs with deep and world-class organizational experience in bringing new systems with wireless technology to market. Some of these projects involved smart phones measuring Bluetooth beacon signal strength, and I have some personal experience with these measurements. I have no specific hands-on experience for system setups where the beacon sending device is also a smart phone, so part of my opinion relies on extrapolating results from adjacent cases.

In the paragraph above am on purpose not naming the specific company labs that I worked for. This is to stress that I am only speaking for myself here, speaking as an engineer and scientist with a PhD degree and broad subject-relevant experience. Curious readers will be able to find the missing names by searching the web.

In order to make the TL;DR statement above and the more detailed statements below, I inevitably had to factor in my understanding of what is currently known about the existence and statistics of the different mechanisms by which the covid-19 infection is transmitted by presymptomatic persons. My main source of reference has been the televised weekly expert briefing and Q&A for the Dutch government, but in the last few weeks I have also occasionally skimmed some of the preprint studies about infection routes that have been mentioned on the web. These studies give me a current best guess picture where distance alone is not the only factor going into the equation, which is actually good, because accurately measuring the exact distance for all contact events, especially short-duration contact events, with Bluetooth alone is pretty much impossible.

Best current guess on the viability of the current design

My expert best current guess is that distance measurements relying on Bluetooth exclusively, when further translated into cross-infection probabilities, will end up producing a warning message mix that will be useless or worse, because it will be entirely dominated by false positives and false negatives.

So I am giving a somewhat high-confidence negative prediction here, specifically for the April 12 white paper system. More generally, this prediction applies to any app based approach that takes the non-use of any data source, other than measurements of Bluetooth beacons transmitted between smart phones, as an axiomatic basis.

Obviously, I cannot prove a negative prediction, and note that I have not made any attempt to quantify the notion of 'not useless' in terms of numeric false positive and false negative rates. Informally, for me, useful means that the app materially adds to the beneficial effect of the more traditional WHO-recommended contact tracing methods which I expect will be used in the EU in the near future to manage R0 while lockdown measures are gradually lifted.

I could end this writeup here, noting that more research is likely needed, if EU governments are to make a convincing case for the adoption of the novel type of contact tracing app were are talking about here.

But the the obvious next question is: can accuracy be improved by extending the white paper app design, dropping the Bluetooth-exclusive approach?

What are the accuracy improvement options?

To get the false positives and negatives in messages to users down to a rate where the app will be useful, I expect that sensor data fusion techniques will have to be used. In these techniques, Bluetooth signal strength data is combined with other sensor data, and the combination is used to compute the risk metric that triggers the messages.

Ultrasound (or audible sound) based range measurements have already been mentioned as an additional sensor mechanism that can estimate distance. My expert opinion is that sound is a promising candidate. However, the use of sound will also make de-anonymization attacks by eavesdropping easier. There is a tradeoff that will need to be investigated. I have medium confidence that adding sound based distance measurements alone will not be sufficient.

It is important to note that sensor fusion can apply to more than just the sub-problem of improving distance accuracy. I expect that knowing the context of each Bluetooth contact event will end up being key to next step in risk prediction, maping contact event distance+duration estimates to cross-infection risk contribution estimates. Some examples of radically different contact event contexts are: 1) the user was walking down a street, 2) driving a car alone, 3) a passenger in a bus, 4) having dinner in a restaurant, 5) shopping in a supermarket. A trivial-case example: contact events detected while a self-reporting infected person was alone driving a private car, events that detected proximity to drivers in other cars, are in the optimal case ignored completely when calculating cross-infection risk.

I now turn to the question of how this contact event context data could be obtained. A first promising method is to use GPS-based location measuremens, or more generally the location facilities of the smart phone OS. The location of a contact event can be used to get the terrain type of the location from a digital map, with the terrain type (road, residential area, shopping area) often yielding a reasonable context estimate. (I will talk about the privacy aspects of this processing in the next paragraph). Automated map matching can be further augmented and improved by relying on the memory of the self-reported infected user, as triggered by reviewing a map that plots out a logged location of history in detail. Were you in a private car or did you take a taxi? Slightly more high-tech, the accelerometer in the phone can be coupled to an algorithm that estimates the current type of phone movement: there are different fingerprints that can be detected for walking, driving, sitting stationary, etc.

I will now outline how the above context based sensor fusion could happen in a privacy friendly way by extending design 2 of the DP-3T white paper. The app is extended to measure and locally log the position of the user, as determined by the OS location services. However, these position logs are never uploaded to any central or distributed database. The logs are only access and processed once, during the data redaction step of design 2, where the infected person voluntarily works with the health authorities to prepare and upload time-ranges of EphID values (or cryptographic equivalent) from their phone. These uploaded EphID values will be annotated with context tags denoting context types, or alternatively only with a single numeric risk-adjustment multiplication factor. Phones from other users that obtain the redacted and annotated data from a backend server will use the annotations when computing personal risk scores.

Note that the redaction step above may also lead to the the decision to simply omit the uploading of EphID data for large time periods altogether, e.g. at all times when the infected user was at home. This redaction decision can often be made because both the privacy (de-anonymization) risks and false positive risks of including this data would far outweigh any risks that omission would create a false negative leading to under-awareness of risk. I include this example specifically to show that privacy and accuracy are not always natural sworn enemies in every possible system design step.

No guarantee of successful improvement

Though I am pointing out the routes to improvement which I consider to be most promising, I am far from sure that even the combined use of all likely improvements will be sufficient to achieve specific hoped-for levels of app effectiveness. It is possible that these levels will forever remain out of reach. My uncertainty here is caused for a large part by the current general uncertainty about the statistical rates for the various different infection transmission mechanisms, and how these rate couple to contact distance, duration, and location type.

Worries about never-GPS rallying cries

I have been looking at the media and privacy advocacy landscape in the last 10 days, admittedly from a mostly Dutch and English language perspective. What I see worries me. I see the emergence of a rallying cry that rejects the use of GPS categorically. In the phrasing of interview talking points and manifesto slogans, the term GPS is somewhat weirdly being used as a convenient shorthand for the unwanted scenario where a government collects a huge mount of location data without meaningful oversight or accountability.

The effect of carelessly labeling GPS as the enemy is that it might end up burning down some bridges too soon. I worry about an outcome where a this meme will crystallize to create a climate where it becomes impossible for any EU government to promote the voluntary mass adoption of any opt-in app, no matter how transparent or accountable, that leverages GPS in any way. That would be a very bad outcome.

noci2012 commented 4 years ago

The problem isn't GPS, it is the connotation that GPS most often means logging of GPS coordinates (lat/lon) and spreading those data around. The NYT has de-anomised an anonymous dataset they obtained and contacted individuals they could derive from the data. https://www.nytimes.com/interactive/2018/12/10/business/location-data-privacy-apps.html https://www.nytimes.com/interactive/2019/12/19/opinion/location-tracking-cell-phone.html https://www.nytimes.com/interactive/2019/12/20/opinion/location-data-national-security.html https://www.nytimes.com/interactive/2019/12/21/opinion/location-data-democracy-protests.html https://www.nytimes.com/2020/02/07/opinion/dhs-cell-phone-tracking.html

The isn't such a thing as Anonymous Location data unless it is bare lat/lon data. (no related ID however random it may be, no timestamps)

henriterhofte commented 4 years ago

It took me a while, but now I see it: "Proximity" is in the name of DP^3T! It served us well to stress that detecting potentially infectious encounters could be done without storing location traces. However, now it starts to get in the way. Just like 'distance' in 'social distancing'.

Remember, the goal is NOT to trace proximity between people in the most accurate way we can.

Rather, the goal is to detect potentially infectious encounters between people, in such a way that if a few days later one of these is confirmed positive, the other can be notified.

In real life, no detector is 100% accurate. To fight COVID-19 an infectious encounter detector doesn't need to be. Its true positive rate in combination with its true negative rate just needs to be 'better' than the alternatives we have now. In fact even this is too simplistic, since the ultimate contribution a 'contact tracing app' has to lowering the R0 value also depends on things such as users having compatible smartphones, users installing and not removing the app, users taking the smartphones with them and not turning them off too often, taking responsible action after receiving an at-risk notification and reporting a confirmed positive test result in the app. Moreover, the effectivelness of a 'contact tracing app' also depends on having sufficient capacity to to testing and this in turn also determines the amount of false positives a detector may have in practice. Scientific simulations that also model the contact tracing app and the user behaviour around it (based on experience in other countries?) could provide more answers.

As argued also above, proximity (i.e. physical distance between 2 persons) is NOT the only factor in detecting infectious encounters.

Is there a word starting with P that captures the idea of potentially infectious encounter better? Then we could thange the name without changing the acronym DP^3T.

henriterhofte commented 4 years ago

As suggested above, ultrasound may help as an additional sensor to detect potentially infectious encounters better. Applying ultrasound can help in two ways:

However, applying ultrasound also has its drawbacks:

henriterhofte commented 4 years ago

And we should not forget, one sensor that we can still thrown in the mix: the human user!

If a mix of sensor input gets it right most of the time, it might be acceptable to leave some judgement to the human user. Rating a questionable infectious encounter situation just after it happened may be much easier than several days later, when you've just had the news of a positive test and are asked to remember your contact history, possibly being distracted by other worries at that moment.

gardners commented 4 years ago

We are going to start doing some near ultrasound experiments with the MEGAphone prototypes we have here, as they give direct hardware access to the microphone and speaker, and its microphones by a stroke of luck are sensitive up to ~30KHz or higher, with actual peak sensitivity at 26KHz, if we can find a speaker that is good to that frequency. If we have any success, we'll be sure to report it.

kholtman commented 4 years ago

In my writeup above, writing as a location tracking technology expert, I concluded that an app relying exclusively on Bluetooth measurements will likely not be useful. So what would happen if you ask the same general question to an epidemiologist reasoning from their expertise?

TL;DR This sort of happened today [April 18]. Same best current guess negative answer.

To unpack this, the Dutch government did a webcast today, the first day of a two-day public-participation contact tracking 'appathon', involving 7 proposed apps and 7 proposing teams. This appathon is causing some controversy, but I won't talk about that here. The news relevant to this issue number is that detailed statements about first app impressions were made by invited representatives of the GGD. The GGD is the branch of the Dutch health authorities directly responsible for contact tracing. So these GGD first impression comments are probably as close as you can get right now in the Netherlands if you want to pin down an epidemiologist and ask for a public best current guess.

Here are subject-relevant highlights from the webcast, as summarized from notes I made.

Edited to add on April 20: I just noticed that this repository already contains a paper with earlier epidemiologist best current guess input at https://github.com/DP-3T/documents/blob/master/COVID19%20Response%20-%20What%20Data%20Is%20Necessary%20For%20Digital%20Proximity%20Tracing.pdf . This document is more optimistic about fully automated methods than the Dutch views reported above. Apologies for not referencing this earlier.

pzboyz commented 4 years ago

There is an aspect here being overlooked, that being the 'time aspect'. Has there been much discussion or agreement on what period of time should two people be in close proximity of each other to to decide that the EphID should be logged? Using the RSSI from one advert is going to be very unreliable as a distance measurement, to maintaining battery life I assume the adverting rate and scanning rates will be chosen so there is detection at about the rate of 1s or so, so it could take 20 to 30s get a good estimate of range or distance.

And go have somebody buy 10 of the same phone model and plot the polar response of the antenna, they will not be consistent. So stop trying to create or calibrate fudge factors for popular makes and models of certain phones.

kholtman commented 4 years ago

@pzboyz

Has there been much discussion or agreement on what period of time should two people be in close proximity of each other to to decide that the EphID should be logged?

A useful level of discussion and agreement was on display among Dutch epidemiologists, in the appathon mentioned above. They went to great lengths to state repeatedly that measured short-duration contact events, like two people walking past each other at <1.5m (and there might be a windowpane in between, or not, and Bluetooth will not know), have no relevance whatsoever. These events should simply never lead to the stil-healthy person later receiving a warning that they need to self-isolate for N days. Without discarding short events, a huge number of people would be needlessly asked to self-isolate. So the epidemiologists could not care less about the accuracy of the distance measurement of the above short event. This is a positive for the technical problem of distance measurement, for the reasons you mention above.

For the record, to arrive at my negative my best current expert guess above, I factored in an expectation that distance accuracy for any contact event less than 1 minute long is irrelevant, and that for for events of less than 5 minutes, it is also not a huge concern.

And go have somebody buy 10 of the same phone model and plot the polar response of the antenna, they will not be consistent.

Somewhat related to this: Inherent directional non-uniformity in antenna response in smart phones has been bugging me, and on top of that comes the phone-in-pocket subcase. I searched the open web and have not found any useful angular plot data yet. I skimmed the tests at https://github.com/opentrace-community and https://github.com/pepp-pt/pepp-pt-documentation/tree/master/12-proximity-measurement (the last one appeared just 2 days ago), but need to read them in full later. In both of these, the phone in pocket experimental design is under-reported to the extent where I am not sure if it tells me anything at all about a contact event where one or two human bodies are blocking the direct line of sight between the two phone antennas.

By the way; I have experience in wireless test design, also test design that can compensate for not having access to fancy test chambers or calibrated test antennas. If anybody out there is planning tests and wants me to review or help with their test design, feel free to contact me by e-mail.

lbarman commented 4 years ago

Cross-referencing @noci2012's issue https://github.com/DP-3T/documents/issues/73 about using long-range transmitters to perform larger-scale sniffing/contact injection.

fneeser commented 4 years ago

It's pretty clear that estimating spatial distance / proximity based (only) on measurements of Bluetooth received power (aka RSSI, Receiver Signal Strength Indicator) may suffer from inaccuracies due to various problems, including:

P1: TX power of a given smartphone type may vary from packet to packet P2: Inaccurate or missing information about TX power P3: Missing information on effect of TX antenna P4: Missing information on effect of RX antenna

P5: Directivity effects of the TX and RX antennas are neglected P6: Smartphone carried in pocket vs. handbag, with or without blocking the line-of-sight path P7: Idealized "free space" assumption for radio propagation as a rough approximation of the multi-path fading channel.

P8: Short duration of individual Bluetooth beacons (advertisements) P9: Relatively low rate of Bluetooth beacons (e.g. 1 beacon per second)

(Here I won't address alternative ways to estimate distance (e.g. via ultrasonic signals), which are already covered in another issue.)

As described in the paper

[1] Hans-Juergen Meckelburg, "Coronavirus COVID-19 - Calibration Method and Proximity Accuracy", Preprint, April 2020

referenced in the (closed) issue #204, the inaccuracies introduced by P2, P3 and P4 depend on the types of the communicating smartphones and result in significant systematic errors, unless they are eliminated by a calibration and correction method provided in the paper.

The points in [1] are very well taken: For example, even if TX power is constant over time, available on the receive side (through a BLE protocol field) and accurate, problems P3 and P4 (neglecting effects of antenna gains) may easily result in an RSSI variation of +- 6 ... 12 dB for a fixed distance between the two smartphones, depending on their types.

For the free-space radio propagation model, a +-6 dB error in the measured RSSI corresponds to a distance mis-estimation by a factor of 0.5 or 2, respectively. A +-12 dB error in the measured RSSI corresponds to a distance mis-estimation by a factor of 0.25 or 4, resulting in an distance estimate of 50 cm or 8 meters instead of the correct 2 meters.

Therefore, I strongly suggest to adopt the calibration and correction method of [1], so the systematic errors can eventually be eliminated by testing various smartphone types against a reference smartphone type [1].

Concretely, I looked at the current code in

dp3t-sdk-ios/Sources/DP3TSDK/Bluetooth/BluetoothDiscoveryService.swift

to see if and how P1 - P4, P8 and P9 are being addressed - noting that we have to live with P5, P7 and likely also with P6.

Also relevant is whether these problems are being addressed by the upcoming (iOS - Android) interoperability spec

[2] Apple and Google, Contact Tracing Bluetooth Specification, preliminary, April 2020, v1.1. https://covid19-static.cdn-apple.com/applications/covid19/current/static/contact-tracing/pdf/ContactTracing-BluetoothSpecificationv1.1.pdf

which DP3T is planning to use as soon as available.

Here's my current understanding (please correct me if I misunderstood anything in the code):

P1:

P2:

P3, P4:

P8, P9:

Specifically for P1 - P4, the calibration and correction method of [1] can be summarized as follows:

(i) The TX side provides a smartphone-type specific txPowerAdj, in addition to txPower.

(ii) The RX side provides a smartphone-type specific rxPowerAdj, in addition to rxPower == rssi.

(iii) A reference power level refPowerLevel is determined through calibration with two reference smartphones.

(iv) Based on the above, a distance estimate in meters is computed as follows:

    rxPower = rssi
    adjustedRxPower = rxPower + rxPowerAdj + txPowerAdj  // all variables in dB

    distance = (lambda / (4*pi)) * pow(10, (refPowerLevel - adjustedRxPower) / 20)
             ~= 0.01 * pow(10, (refPowerLevel - adjustedRxPower) / 20)

NOTE:

Addressing (i) - (iv) in the code looks straightforward. However, txPowerAdj is currently neither provided by iOS nor by Android in their advertisement headers - seems like a good idea to request its inclusion in [2].

winfried commented 4 years ago

Contamination proximity is not circular and not static. Neither is measured proximity. So there will always be an error between these two. You can try to overshoot in your app, leading to many false positives, try to match as close as possible, leading to both false positives and false negatives, you can also undershot, mainly leading to false negatives.

An other approach would be using multiple channels to try to determine the dynamics of the contamination proximity, for example detecting walls / plexiglass with ultrasound, detecting sneezes by listening and/or movement, using acceleration and tilt sensors to determine if people are faced to each other en close together for a longer time. Such a compensation mechanism would be a research project on its own and probably has its own privacy impact.

But, did anybody here try to estimate what amount of false positives or false negatives would be tolerable and with it what strategy for approximating the contamination distance would be best?

The paper by the Oxford research group (which has its issues, but lets ignore those for now) states that success in quarantine should be at least 70-80%. Also the paper says the delay between first symptoms and putting contacts in quarantine can't be longer then 24 hours, so there is no time for testing the people who have been in contact, they have to go into quarantine right away. If 10% of the population doesn't go into quarantine because the app says so, then the amount of tolerable false negatives can't be higher then 10-20%. That means that we have to go with an overshooting strategy or a highly advanced adaptive strategy. But lets go for an overshooting strategy. To compensate for inside situations with a slow draft and people sneezing, a measured proximity of probably something like 3-4 meter is needed. On top of that comes the imprecision of the bluetooth measurement. If it has an error margin (rain, obstruction by bodies or objects in bags, interference by other sources etc) of 4 meter (what is very positive, I usually read about bluetooth proximity error margins tipping up to 10 meter) then that margin has to come on top of it, because we can barely afford false negatives. So everybody with a measured bluetooth distance of something like 8 meters (or 14 if take the error a bit bigger) has to go into quarantine right away. For one trip to the supermarket for getting my groceries, that can easilly be something like 100 persons. Extrapolate that to several activities over 5-14 days, depending on how long you assume you have been contagious, and you are speaking about something like 1000-5000 people that have to go into immediate quarantine. In the current situation in the Netherlands, lockdown and heavily under tested (real amounts of contaminations can be up to factor 300 higher) there are per million inhabitants about 50 positive tests a day. That would mean that per million inhabitants 50.000 up to 250.000 people a day need to go into quarantine. Multiply that with the length of quarantine and the false positives will lead to an unacceptable amount of false positive people in quarantine.

Conclusion: even when I make favourable assumptions, the narrow bandwidth between acceptable false negatives and acceptable false positives doesn't allow for automatic contact tracing to be successful.

lbarman commented 4 years ago

Hi @fneeser; thanks a lot for the detailed inputs. I do not have the answers to these question personally, but these kind of analysis is certainly very helpful. I hope our Bluetooth team can soon pass by and answer the questions; it is anyway a good basis for discussion!

Hi @winfried; we know this is a critical point. Updating my summary on top. Thanks!

zukunft commented 4 years ago

@kaepora: To solve the traffic light issue I personally would enable to add the GPS position to the bluetooth contact on my local phone. Because it is open source, I can be sure that the GPS position never leaves my phone. And if I get the "someone has been tested positive" signal my phone can show me the location of the contact. Because some people perhaps not like this, I would make this optional in the App with the default setting https://github.com/DP-3T/dp3t-app-android/issues/32

Matioupi commented 4 years ago

@zukunft : despite having a hard time explaining the public that enabling the GNSS will not result in tracing you, there are still lots of unsolved cases with this technique (underground transports, etc.) Even the traffic light example still holds. The GNSS position alone (given actual urban accuracy) will have a hard time telling if you are a pedestrian on the pavement side waiting next to other people for the car flow to stop, or a driver stopped at red light. I understand other parameters such as speed (current / past / future) or other information could be added, but sorting this will be hard and I believe those fine grained analysis are well beyond the scope/schedule that seems to be driving the work right now and will not be included at time of first official app release.

zukunft commented 4 years ago

@Matioupi Yes, I agree with you that enabling the GNSS will be hard to explain to many. I think this feature should not be in the first version of the standard app. And I agree, that this will not solve all Bluetooth false-positive cases, but it can solve some. E.g. for the pedestrian problem that you mention the app probably cannot decide if you are a pedestrian or if you have been in the "save" car. If the app shows the smartphone owner the place of the positive contact, the owner might remember, if he used the car or if he has been walking. The first reaction of a user could be that he is afraid if he gets a "positive" message and I could be a great relief if he knows, where it happened. My suggestion is that a separate "DP-3t app with private location remembering" is created in parallel on a fork.

peterboncz commented 4 years ago
Matioupi commented 4 years ago

@peterboncz : what you call the danger score is called Risk Scoring function in DP-3T documents. Issue #235 is about it's specifications which are not yet defined (or at least not in the repo yet).

gbounch commented 4 years ago

Hello everyone, I would have an operational proposal that starts from the fact that the conditions for detecting distances are too variable. Ble isn't accurate, Even if it were, the environmental conditions would be too variable. To make the measurements more precise, in the ALTBeacon library they tried to create a database of device models with their signals but, given the complexity of the measurement, the attempt to crowdsourcing for the generation of the database failed. (https://github.com/AltBeacon/android-beacon-library/issues/967)

In our context, however, the real use case needs to understand how devices behave at the Droplet distance (2 meters) only, not at all possibly distances. If the application you are building will be used, the user could be involved: The application signals an alarm, the user reports that the reported distance is incorrect, the user is asked to move closer to 2 meters and then press a large button . In this way we could generate a database models couples with their behavior at 2 meters. Using the formulas above (fnesser)

distance = (lambda / (4 * pi)) * pow (10, (refPowerLevel - adjustedRxPower) / 20)
             ~ = 0.01 * pow (10, (refPowerLevel - adjustedRxPower) / 20)

or the simplest Math.pow (10d, ((double) txPowerValue - rssiValue) / (10 * ambient))

We can calculate some Average constants and an average behavior that refine the distance calculation. I am quite sure that with a large number of data the precision can improve a lot.

We, with my company, are developing an application for use in companies. In a company the environment is much more controlled and we can think of involving users. Furthermore, the proximity relationships are always in very similar environments, probably allowing for fewer variables.

If you agree, we could base our realization on this assumption, and begin to determine data from a protected environment (with environmental constants with little variance) and share the results. Furthermore, if the model can also work according to you, we could involve schools to make Flash Mobs in which different students approach 2 meters away with their smartphones and then press a button to calibrate ... We are from Brescia, the area of ​​Italy most affected by the coronavirus. I think there is a lot of sensitivity in this regard.

fneeser commented 4 years ago

@gbounch I'm not surprised that tuning power level adjustments based on user feedback did not work: For example, if one attempts to tune rxPowerAdj for a smartphone of type X while receiving beacons from a non-reference, non-calibrated smartphone of type Y, the tuned rxPowerAdj will also include an inappropriate correction to account for the missing or wrong parameter txPowerAdj of smartphone type Y.

Tuning for a smartphone type X should be done only in a controlled lab environment, and only while testing X against a reference smartphone type REF.

As described in the paper [1] Hans-Juergen Meckelburg "Coronavirus COVID-19 - Calibration Method and Proximity Accuracy", Preprint, April 2020 -- see issue #204

and in my summary https://github.com/DP-3T/documents/issues/188#issuecomment-618666631 above, three steps are required:

  1. Determine the parameter refPowerLevel (= CP_{RT}^{REF} in [1]) of a reference smartphone type, which by definition is a constant, i.e., independent of the distance between the two reference smartphones.
  2. Determine txPowerAdj of smartphone of type X by measuring REF <- X, specifically, txPowerAdj[X] = rxPower(to: REF, from: REF, distance: d) - rxPower(to: REF, from: X, distance: d) Note that txPowerAdj[X] is independent of distance d.
  3. Determine rxPowerAdj of smartphone of type X by measuring X <- REF, specifically, rxPowerAdj[X] = rxPower(to: REF, from: REF, distance: d) - rxPower(to: X, from: REF, distance: d) Note that rxPowerAdj[X] is independent of distance d.

While these parameters are independent of distance d, they can be determined accurately by repeating (and averaging) the calculations over multiple distances d.

I believe that the author of [1] is already working on a test specification, where I assume effects such as TX and RX antenna directivity can be averaged out.

gbounch commented 4 years ago

I know that your proposed model is perfect. Requires a large number of measurements for each device, and it is simply impossible for someone to start testing for all devices. So even if perfect, it doesn't help. Is there any way to simplify it if we decide that we just need to know how 2 models of smartphones perceive each other 2 meters away? if I could have a large number of measurements of signals perceived between two models (even in non-laboratory environments) could I infer the threshold at 2 meters of this device models couple, with a good approximation?

helme commented 4 years ago

Hi all, to be honest I didn't read all posts here yet, but I've noticed that there is a major concern about using BLE only. Besides leveraging other approaches (like ultrasonic sound beacons), I'm still convinced that BLE is not as bad as you might think.

I'm a co-author of this paper where we analysed the data from a early version of PEPP-PT with experiments done in German barracks. In this paper we propose a simple yet powerful model for contact event detection. For this we proposed several metrics (functions of risk depending on distance) and a reference sequence (e.g. being closer than 1.5m for 15 minutes). Together with labelled data (distances and RSSI measurements) we were able to train a very simple linear regression model on three features (mean, max and length) and already achieved quite convincing true and false positives rates.

as a disclaimer: this experiments are done with same devices, this is of course not enough. But we also tested with different devices and observed only minor effects (once calibration is computed).

Please let me know if this is useful, I could share our code in case of necessity ;)

gbounch commented 4 years ago

HI helme, for me it is very interessting, your paper says that if we collect RSSI from all devices and the exposition time, asking user (some one) to label his distance we can have more data about more devices so to let this model react better. Do you think it is right we can label only at droplet distance to feed the model?

helme commented 4 years ago

Do you think it is right we can label only at droplet distance to feed the model?

Yes sure, I think there is a lot of possibilities to scale the data-labeling-process. Can someone point me towards to data-collection-process in DP3T? especially to the test done with the swiss army? I think they already get labeled data using cameras.

Matioupi commented 4 years ago

@helme : When you did the study reported in your paper, were phone actually carried by peoples or layed on a table or other stand type ? If carried by actual human bodies, did you take care that they could have very different relative orientations and phone holding methods ? From what I read, the human body absorbtion in the 2.4 GHz will be about 7 to 9 dBm which from RSSI ranging will be something in the range 2-2.5 m.

From what I read in non-covid related indoor positionning techniques paper (where the phone is usually hand holded as they are adressing location finding process where you need real time interraction with the app) non fingerprinted RSSI only ranging techniques (usually ranging to fixed calibrated beacons in those papers) yelds to about 1 to 1.5 m average error when sampled for long enough. min to max range seems to be more like 2.5-3 m.

Do you have an idea of the sensibility of your model training to the distance accuracy ? (in other world, if real world data collection condition were to 2-2.5 m errors, how would your model behave ?)

helme commented 4 years ago

When you did the study reported in your paper, were phone actually carried by peoples or layed on a table or other stand type ? If carried by actual human bodies, did you take care that they could have very different relative orientations and phone holding methods ? From what I read, the human body absorbtion in the 2.4 GHz will be about 7 to 9 dBm which from RSSI ranging will be something in the range 2-2.5 m.

The phones were carried in the hands. In later studies we also tested with different holding position (head, pocket, hand) and observed no major effects. Nevertheless this is a valid concern from you. But we observed that these effects cancel out when collecting data over a longer period of time.

Do you have an idea of the sensibility of your model training to the distance accuracy ? (in other world, if real world data collection condition were to 2-2.5 m errors, how would your model behave ?)

This was not studied yet in detail, but the model will just behave somewhat "linear" i.e. the risk increases with higher aggregated RSSI-values. This can be overcome by calibration (as described in #204)

pzboyz commented 4 years ago

@helme wrote:

Hi all, to be honest I didn't read all posts here yet, but I've noticed that there is a major concern about using BLE only. Besides leveraging other approaches (like ultrasonic sound beacons), I'm still convinced that BLE is not as bad as you might think.

For me personally, my comments are hoping to guide this group into not putting too much effort into aiming for absolute accuracy as there are just too many variables. The structure of the buildings play a big factor. A lot of thought and energy can be put into this subject for realistically very little reward.

The problem with any calibration, is you have to rely on the user to estimate what 2m/6ft looks like. Can you trust them to do that? And would a malicious person perform the calibration such that it convinced the app that 6m/18ft should be seen as 2m/6ft? Thereby creating many false positives if they later do test CV19 positive.

@helme wrote:

Please let me know if this is useful, I could share our code in case of necessity ;)

I looked at the paper. I did not see the scanning and advertising rates used. If you used something like the NrfConnect app to do these tests and have set rates to high values, this is not realistic for long term use as it would drastically reduce battery life. At advertising rates of 250 to 300ms and scanning duty cycle of less than 5% (it is the scanning duty cycle which reduces the battery life more than the advertising rate, but you do not want to crank up the advertising rate and pollute the airspace with advert packets) a device may detect packets from each device at about 2s intervals. So how many of those packets did you process and average? No scheme should be based on single detected BLE adverts.

gbounch commented 4 years ago

Thank you @pzboyz , when you say that a correct accurracy gives little reward you say that it is not in scope to have notifications that the user is too near to the other one. In the use cases i am facing, in enterprises, this requirement is very mandatory beacause enterpreneurs whant to use applications like this to enforce security. and the lack of accuracy leads to a number of false positives that could make the application less effective. Do you think that with actual approximations it can still be effective?

Furthermore, if the user is asked to place himself at "droplet" distance, we may not know exactly how far he is going(2 or 6 mt), but a machine learning model can still give a correct approximation and remove cases of measurements that are too oversized.

moreover, if we decide that this approach can work then we can proceed with user support using ultrasound or the camera during the "labeling" phase to be sure that the distance between the two devices is correct.

helme commented 4 years ago

For me personally, my comments are hoping to guide this group into not putting too much effort into aiming for absolute accuracy as there are just too many variables. The structure of the buildings play a big factor. A lot of thought and energy can be put into this subject for realistically very little reward.

I totally agree on that! Nevertheless we need to come up with a reasonable statistic on received BLE beacons an their RSSI values. This is why we propose simple linear models for which we already observe convincing results

I looked at the paper. I did not see the scanning and advertising rates used. If you used something like the NrfConnect app to do these tests and have set rates to high values, this is not realistic for long term use as it would drastically reduce battery life. At advertising rates of 250 to 300ms and scanning duty cycle of less than 5% (it is the scanning duty cycle which reduces the battery life more than the advertising rate, but you do not want to crank up the advertising rate and pollute the airspace with advert packets) a device may detect packets from each device at about 2s intervals. So how many of those packets did you process and average? No scheme should be based on single detected BLE adverts.

To be honest, I don't know about the exact technical details here, but we observed frequencies according to this plot: image

i.e. most samples were received at 3 Hz, some more, some less. Sometimes we also observed multiple beacons in a very short time-period (maybe effects of reflection?). But I think this can and should be downsampled to save battery. Depending on experimental setup, we received samples of different length (at least two minutes, at most 30 minutes), so different number of RSS values per sample. In order to build reliable labels we resampled the data to 1 Hz. But of course this can be adapted to your needs.

Does this answer your questions?

gbounch commented 4 years ago

Hi, we are planning to create some test data of the bt behavior in handshackes to train a model. In this dataset we think to extract this features:

later we will think about label data with something about location (inside/outside) (town, free space) talking whith the gps info. @helme could you share your datasets ad your codes? Could we share our results with you?

kholtman commented 4 years ago

@pzboyz @helme I have not been following all posts in this thread in the last 6 days, but I want to say I agree with and echo the warning

my comments are hoping to guide this group into not putting too much effort into aiming for absolute accuracy as there are just too many variables

See also my first post above, where I recommend a sensor fusion approach, to improve the accuracy of infection risk prediction given how bad I expect Bluetooth to be, and also given my knowledge that Bluetooth accuracy can depend a lot on the location you are in.

But for those who are not discouraged by this:

@gbounch I have a lot of experience with test design for the type of large parameter space RF test you are planning conducting. Feel free to reach out to me if you would like me to spend an hour with you in a telco to interactively discuss the test setup you are considering. There is an option for you to use wifi packet unicast/broadcast and sniffing on Android (not sure if this is also possible on iOS) to greatly accelerate exploring the test space for worst-case outliers. This option exists because almost all smart phones use the same tx/rx antenna for both Bluetooth and WiFi.

kholtman commented 4 years ago

(This post uses radio technology jargon I usually try to avoid when posting here, so the intended audience of this post is RF technoloy experts only)

@helme thanks for clarifying some details of about the design of the German range tests. You write

The phones were carried in the hands. In later studies we also tested with different holding position (head, pocket, hand) and observed no major effects. Nevertheless this is a valid concern from you. But we observed that these effects cancel out when collecting data over a longer period of time.

It is a relief to know you also did pocket tests, I had been wondering about that, but I still have open questions:

My own gut feeling and experience:

For ranges under 4 m, the existence or non-existence of local multipath reflections will play a major role in determining RSSI-to-distance accuracy, with the floor (carpet-covered concrete? grass?) the measurement is done on and the presence or absence of a ceiling being the major multipath drivers. For longer distances the path loss exponent matters more, but with the shorter distances that are mainly of interest here, you are in a 1x1 antenna multipath fading regime, that may or may nor have a line of sight dominant path.

The hopping between the 3 Bluetooth socal channels could be said however to give you 3x1 antenna diversity, even if the phones are fully stationary, e.g. on desks. This compensates.

From personal experience measuring this, people moving even slightly may cause completely new multipath fading effects, even with phones on desks but especially if the phones are are in their pockets. This gives interesting opportunities, if you have enough samples, to filter RSSI looking for peaks and and not medians or averages, to compensate for both a) the possible LoS blocking (more accurately, dampening) by the phone-adjacent human body/bodies if phones are in pockets (pants front/back or shirt), and b) the inherent unevenness of the directional field of the antenna, which is often large in a phone, much more so than in statioary devices that are usually tested. Typically the phone Bluetooth/Wifi antenna is a single ceramic antenna on the edge of one of the phone PCBs, acting against a near-by ground plane on the PCB. I believe the middle of the bottom side of the phone is the most popular position; if you know what these antennas look like, you can look at pictures posted on the web by tear-down enthusiasts, to locate the antenna for a particular phone model.

I have seen some speculation on whether tx/rx adjustment factors for a particular phone model vary even between different serial-number devices of the model, but from my experience, cross-serial-number variation should be minor, less than 3 decibels or so. If cross-serial-number measurements are done without a test stand in an anechoic chamber, I expect that noise from the factors described above might easily be mistaken for much a larger cross-serial-number variation.

All the above suggests that you need to have test subjects role-play chatting together standing, or sitting at an office table, (or an adjacent table in a restaurant setting), for time periods of several minutes, not tens of seconds, if you are looking to establish best-case estimates. And your need to test all 4 possible orientations of a phone in a pocket.

I made some comments on channel saturation in #170.

[May 1: updated to add: It just occurred to me that one might be able to do filtering that first uses the correlation or non-correlation of RSSI values from different social channels as a detector to find out if you have a dominant line of sight path between the two antennas, or whether line of sight is blocked. This yes/no bit could then be used to select one of two different filters tuned for that particular case, which would improve overall accuracy. This is fairly speculative, but worth a try.]

helme commented 4 years ago

@helme could you share your datasets ad your codes? Could we share our results with you?

@gbounch I'm still asking the people about this issue, but I didn't received a answer yet. until then I'm not allowed to share, but I'll keep you updated about this. Nevertheless I could share some model parameters of our linear regression model (it's just a 3-dimensional vector and a bias term since we observed that this is already enough in case of so much noise).

Anyway, I'm interested in your experiments you are going to do and I could help you in analyzing the data. Maybe togehter in a telco with @kholtman and @gbounch ?

helme commented 4 years ago
  • What I can't figure out (also not when looking at the more detailed figures in the longer reports in the PEPP-PT document repository) is the freedom or non-freedom the subjects were given on the direction they were facing, not just position occupied. Was the direct line of sight between the phones in pocket ever blocked by human bodies? Did you separate out the blocked/non-blocked cases in your analysis?

unfortunately I don't know about, but I'm pretty sure that they had some freedom in direction (at least from the signal I would expect this issue). To be sure I could ask some people for the video recordings made while the experiments in order to reverse engineer this issue.

  • How many minutes (and RSSI samples) did the persons stay in the position at each stop in the floor pattern?

The same experiment was done for four different stopping-times, i.e. 2,4,6 and 10 minutes.

  • Can you quantify about the 'longer period of of time'? (See remarks on my own experience below.)

Depending on the experimental setup (2,4,6 or 10 minutes of stopping-time) we recorded peer-to-peer-bluetooth-connections up to 30-45 minutes (in case of 10 minutes stopping-time). But I can deliver more precise statistics if needed?

  • Can you tell us the type of floor (wood, carpet covered concrete, ??) used and presence of other nearby reflectors? (See remarks on my own experience below.)

The setup was duplicated five times, where three were indoor (conference room with concrete floor) and two outdoor (probably concrete floor too, because they needed to fix the markings on the floor). Please see #235 for a summary-plot of one day of experiment.

For ranges under 4 m, the existence or non-existence of local multipath reflections will play a major role in determining RSSI-to-distance accuracy, with the floor (carpet-covered concrete? grass?) the measurement is done on and the presence or absence of a ceiling being the major multipath drivers.

@kholtman you are far more experienced in this topic than me, because I'm just a computer scientist (for data science and machine learning in general, no experience with these kind of measurements). But we thought also about the same issues, but didn't investigated deeper yet. For this we need more experiments.

This gives interesting opportunities, if you have enough samples, to filter RSSI looking for peaks and and not medians or averages, to compensate for both a) the possible LoS blocking (more accurately, dampening) [...] and b) the inherent unevenness of the directional field of the antenna, which is often large in a phone, much more so than in statioary devices that are usually tested.

I agree on that, this is why we use the "maximum" besides "length" (i.e. the number of beacons received before or after resampling) and "mean" as input-features to our linear regresion model risk-estimation.

I have seen some speculation on whether tx/rx adjustment factors for a particular phone model vary even between different serial-number devices of the model, but from my experience, cross-serial-number variation should be minor, less than 3 decibels or so. If cross-serial-number measurements are done without a test stand in an anechoic chamber, I expect that noise from the factors described above might easily be mistaken for much a larger cross-serial-number variation.

We didn't looked as cross-serial-number variation but instead on cross-devices-variation. There we observed major deviations (up to 10-20 db). For this we computed calibrations for each combination of devices, which is mainly an additive term to adjust for the inter-device-differences.

All the above suggests that you need to have test subjects role-play chatting together standing, or sitting at an office table, (or an adjacent table in a restaurant setting), for time periods of several minutes, not tens of seconds, if you are looking to establish best-case estimates. And your need to test all 4 possible orientations of a phone in a pocket.

Yes, you're right, we should thing about all this scenarios. Is the data from the experiments with the swiss army already accessible? I think they have done already quite cool experiments. Does anybody know more about this?

kholtman commented 4 years ago

@helme You write

I'm still asking the people about this issue, but I didn't received a answer yet. until then I'm not allowed to share, but I'll keep you updated about this.

I could understand their reluctance here: publishing an open data set requires a lot of time and effort to document experimental conditions (e.g. experimental details I was asking about above), and this time cannot be spent on other pending work. In software project management terms. you have have a Brooks's law problem.

That being said, I want to encourage you to share my radio technology expert post above with the RF specialists you have been working with -- I feel that this may be useful to them if they are planning a setup for additional experiments.

helme commented 4 years ago

That being said, I want to encourage you to share my radio technology expert post above with the RF specialists you have been working with -- I feel that this may be useful to them if they are planning a setup for additional experiments.

@kholtman indeed I have Brooks's law problem. Unfortunately, due to this total mess in the last two weeks, I stopped working on this topic and therefore I don't know which RF specialists are currently working on this topic. According to @gannimo DP3T will also release their data soon.

merv1000 commented 4 years ago

I’m very out of my depth here but I’ve been trying to do some research for a project and came across github. Does anybody think this tech could be part of a solution to your problem? https://wyldnetworks.com/wyld-mesh-1-0

merv1000 commented 4 years ago

https://wyldnetworks.com/wyld-mesh-covid-19-proximity-contact-tracing/

kbobrowski commented 4 years ago

@helme could you share your datasets ad your codes? Could we share our results with you?

@gbounch I'm still asking the people about this issue, but I didn't received a answer yet. until then I'm not allowed to share, but I'll keep you updated about this. Nevertheless I could share some model parameters of our linear regression model (it's just a 3-dimensional vector and a bias term since we observed that this is already enough in case of so much noise).

@helme your paper looks very interesting - maybe we would be able to estimate risk scores accurately even if distance measurement at single time point is not accurate. I'd also be interested in raw data, in a meantime it'd be great if you could share your model parameters!

I've done some experiments with RSSI measurements and distance estimation based on a way of wearing the phone, indeed human body blocks it quite well and wearing it in back pocket (with receiver in front) results in significant distance measurement errors (true distance ~1m, measured distance >10m). But perhaps it averages out over relevant contact period.

Recording of the experiment: https://www.youtube.com/watch?v=SAi24ctpyZQ

It got me thinking though about relatively stationary but close situations, like e.g. public transport where two people might sit next to each other with relatively small movement, with a lot of body blocking the signal depending on where they wear their phones. Perhaps we could vary RSSI threshold with number of devices detected nearby - more devices would mean weaker signal strength required for "contact event".

pdehaye commented 4 years ago

@kbobrowski you can find the same assessment here: the shadowing effect due to a body is of 20dB, which amounts to a factor of 10 in measuring distance.

kbobrowski commented 4 years ago

@pdehaye thanks, interesting link!

sslHello commented 4 years ago

@peterboncz @kaepora, @zukunft: reduce false positives (referencing one of the earlier comments, e.g. https://github.com/DP-3T/documents/issues/188#issuecomment-619414523)

* rather than measuring distance, the app should measure danger, and compute a **danger score** for BLE logged contacts. Estimated distance is of course part of the score; interaction time is another.

May I suggest a relatively simple solution to reduce false positives, see "Use TOTP (RFC-6238) to generate EphIDs from daily Changing SK(t)" (#303): Get more than one EphID within the contact time, honestly get as much as possible. If the EphID changes with every beacon that is sent, you can recognize them later as beeing sent from the same person if you have received the daily secret key of a positive tested person. You need to store at least all EphIDs in a time-frame before and after you have received strong signals. So you can raise the risk if the duration expires a certain time that has been given by virus experts. This also helps to reduce false positives caused by collisions of EphIDs which may be caused by the restriction of the size of EphIDs to 128 bits by the data size of BLE beacons. So e.g. at least two successive EphIDs or 2 of 3 EphIDs could be used to match the contact, if at least one of them has been received with a strong signal. Further to go on reducing the risk of collisions and additional to reduce the data and to increase the performance, verify if you could use some additional bits to roughly indicate regions, e.g. https://github.com/AltBeacon/spec (-> MFG RESERVED or REFERENCE RSSI).