App/server overview is not sufficient to address privacy conerns

gh-andre commented 4 years ago

The README for this project is very generic and not useful at all in determining how the app works. The linked video is too deep in the weeds of how to run Docker, displays functions, hours rounding, etc, but fails to identify how exactly device proximity data to other diagnosis keys is stored. Something does not add up here.

Consider this database schema in the video at this time stamp:

https://youtu.be/5GNJo1hEj5I?t=885

It shows the diagnosis key in key_data and transmission_risk_level that clearly is computed elsewhere, but nothing says where that other data is stored and what is the retention policy for that data.

Specifically, since diagnosis for any diagnosis key always comes in later (i.e. after the interaction was already recorded), all diagnosis keys and their proximity to other diagnosis keys must be stored somewhere until one of them receives a positive diagnosis.

So, where exactly this information is stored? You need to describe it better, including on the government page for the app, instead of just saying a bunch of words that sound secure, like "random codes".

obrien-j commented 4 years ago

Thanks the questions @gh-andre, we've had a comment on this previously on the app's repo as well.

I'm going to move this issue up into the 'covid-alert-documentation' repository as well, as that'll help us centralize efforts on improving the clarity and consistency of our documentation.

Do you have a specific set of questions that you'd like clarity on to help us focus our efforts?

obrien-j commented 4 years ago

It seems like you're curious about how Rolling Proximity Identifiers (aka 'random codes', or RPI's) and Diagnosis keys (aka: different 'random codes') are stored?

If so, the diagnosis keys, which are just re-named Temporary Exposure keys (randomly generated 16-byte values), are stored in the national servers database without any linkage to other pieces of information. RPI's that your phone broadcast are stored by other phones in their exposure notification subsystem. This information is never presented to the application itself, and is only ever used by the subsystem to match against diagnosis keys that the app downloads from the national server itself. This matching process leverage several factors to determine if there's an exposure, one of which as you mentioned 'proximity', is actually represented by a signal attentuation, measured by a ratio of transmit power and received signal strength, which are stored with the RPI's locally.

gh-andre commented 4 years ago

The point is, the key for deploying this app is to gain some community trust and vague descriptions are not helping. If somebody doesn't understand how it works, they will not read Google's contact tracing scanning flow sequence diagrams, but simply will shrug off this app.

You need to present a diagram of interactions in human-readable terms, describe what databases are used, where data is stored in general (i.e. mentioned databases, owner's device, other people's devices), as well as to assure people that there is no retention of data in any form, such as backups (I saw an issue here talking about one-day backups and a system like this shouldn't have any, but should only rely on redundant storage; there must be no backups).

Specifically, you need to describe what happens in any of the following scenarios:

if any of the databases is compromised, what will attacker gain from that information?
if somebody gets their copy on random keys for other people on their own device, what can they gain?
if somebody steals my device, what they will gain?
can somebody fake positive COVID-19 diagnosis?

As an example, you would describe that Rolling Proximity Identifier is generated every 15 minutes, so if somebody followed a person with the same device in Bluetooth signal proximity, they wouldn't track the person because RPI's are generated every 15 minutes. Same for all bits - daily tracing keys, diagnosis keys, etc.

If you don't describe those, no amount of doctors saying they love this app will help installing it on as many devices as is needed for contact tracing.

This information is never presented to the application itself

That's security by obscurity and is irrelevant. You should assume that any user can hack their own device to gain full access to this data, so your narrative should be that this data, if obtained by the device holder, cannot be used to retrieve diagnosis data for these RPI's.

Lastly, if it is all anonymous, how would a health professional contact a person if they are sick? None of the descriptions are clear on that, which suggests that there's something in the database that tracks a person for this contact. You need to describe that a health professional contacts a person whether they have an app or not, gives them the diagnosis and if the person has the app, then they also get the one time code to notify others. Clarity is important.

kidesign commented 4 years ago

These are very legitimate concerns. I echo these questions:

how does public health generate the secret key to provide to person infected with COVID, is it a pool of numbers
if I ran a script on a phone emulator could I get a list of “working codes” using brute force.
if a device is deemed rogue how can we disable it.

Sent from my iPhone

On Aug 2, 2020, at 5:24 PM, Andre notifications@github.com wrote:

The point is, the key for deploying this app is to gain some community trust and vague descriptions are not helping. If somebody doesn't understand how it works, they will not read Google's contact tracing scanning flow sequence diagrams, but simply will shrug off this app.

You need to present a diagram of interactions in human-readable terms, describe what databases are used, where data is stored in general (i.e. mentioned databases, owner's device, other people's devices), as well as to assure people that there is no retention of data in any form, such as backups (I saw an issue here talking about one-day backups and a system like this shouldn't have any, but should only rely on redundant storage; there must be no backups).

Specifically, you need to describe what happens in any of the following scenarios:

if any of the databases is compromised, what will attacker gain from that information?
if somebody gets their copy on random keys for other people on their own device, what can they gain?
if somebody steals my device, what they will gain?
can somebody fake positive COVID-19 diagnosis?

As an example, you would describe that Rolling Proximity Identifier is generated every 15 minutes, so if somebody followed a person with the same device in Bluetooth signal proximity, they wouldn't track the person because RPI's are generated every 15 minutes. Same for all bits - daily tracing keys, diagnosis keys, etc.

If you don't describe those, no amount of doctors saying they love this app will help installing it on as many devices as is needed for contact tracing.

This information is never presented to the application itself

That's security by obscurity and is irrelevant. You should assume that any user can hack their own device to gain full access to this data, so your narrative should be that this data, if obtained by the device holder, cannot be used to retrieve diagnosis data for these RPI's.

Lastly, if it is all anonymous, how would a health professional contact a person if they are sick? None of the descriptions are clear on that, which suggests that there's something in the database that tracks a person for this contact. You need to describe that a health professional contacts a person whether they have an app or not, gives them the diagnosis and if the person has the app, then they also get the one time code to notify others. Clarity is important.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/cds-snc/covid-alert-documentation/issues/10#issuecomment-667726486, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGAFWJOUM4LCBSWPBL3OBYDR6XKPDANCNFSM4PSWRVUA.

gh-andre commented 4 years ago

The more I browse through changes from the original design, the less confident I get from this forked development. There is a reason Google and Apple developed specific guidelines and changing them to "adapt" to local health care needs is only making it more vulnerable to various privacy issues.

For example,

I see now that the lifetime of one time codes has been extended from 10 minutes to 24 hours, so now somebody getting their phone call in a restaurant and writing it down on a napkin leaves this code available to others for 24 hours.
There is apparently some logging in the app and no description of log retention policy or what's in the log.
There's a new hashID introduced to allow people issue their own claim codes, without any documentation of possible security implications and measures to mitigate them.
There are some AWS SNS alerts added without clear description of what is being sent and how this data is retained.

These are just things that popped up on the surface, without looking at the code much. You should be running Google's original server and app with transparent practices on log retention, backups, etc. This way we at least would know the code went through some security experts at Google and Apple.

If you want to change how it behaves, you should submit your changes to Google, so they adapt their code under their security practices and guidelines. Taking a secure server and changing it up to make it more "convenient" for various local parties is how security and privacy breaches are created.

At the very least you should commission Google folks to review your changes and provide their guidance in terms of security and privacy.

burke commented 4 years ago

For what it's worth, it wasn't Google that developed the upstream pre-fork codebase. That was built, as a reference implementation of Apple/Google's Exposure Notification frameworks, prospectively for an ultimate handoff to CDS or some other Canadian government entity, and in as simple and privacy-preserving a manner as possible. Changes and improvements are being made here, post-handoff, rather than in the upstream repo, because the project also took on a life of its own outside of Canada and it's much simpler for CDS to not worry about conflicting with those other users while prioritizing getting COVID Alert working for Canada. We do actually intend to take most or all of these changes upstream, but just haven't got around to it yet. Speaking as a maintainer of one part of that upstream project, I can safely say that I haven't come across a change in this fork that I haven't found completely compatible with our original privacy stance and vision.

gh-andre commented 4 years ago

@burke

There is a point being missed here. Developers are terrible at security, in general, and things like bumping up expiry for one-time codes from 10 minutes to 24 hours are very questionable and should be confirmed by those with security skills, whether it's the original designers or an outside security consulting company.

The app will be used only if people have confidence in privacy features of the app and so far statements like "random identifiers" are misleading because if they were random, you wouldn't be able to receive a diagnosis key, which is a daily key and match it up against rolling proximity identifiers.

You need to come up with a good diagram that shows how the app interacts with other phones, how it receives diagnostic keys and matches them up against RPE's, whether a tracing key, which uniquely identifies the device, ever gets into the cloud database, what the device public key in the database is used for, and so on. This is just to show people what the app does.

On top of that, you need to describe how data can be exploited (or not) if a database gets compromised, if person's device gets compromised or if a remote device gets compromised with my keys on it. This description should be done by a security-aware person and not a developer.

Lastly, you need to describe retention policies for logs, database backups, encryption at rest policy for AWS database, and any communication with AWS, like SNS messages.

One important point in all of this, I'm not asking this for myself to entertain my curiosity about the app. If you want the app to be used, this is what needs to be done to convince the layer of the population who are not completely against it, but not quite sure what the app does. You can ignore this, but it will not help you to install the app in numbers we all need to make contact tracing useful.

burke commented 4 years ago

Of course security audits are important (and, if you peruse the closed issues on these repositories, you'll see quite a bit of activity of that nature).

To respond to a couple of specific points here though:

"Random identifiers" is—unfortunately—the user-facing language used by Apple and Google in consent prompts, which paints application implementors into the corner of having to mirror this same language.

Most of what you're asking for is actually the domain of the Exposure Notification frameworks in Android and iOS. While it's true that COVID Shield and COVID Alert would do well to explain this integrated system of app and framework holistically, the reason the early documentation doesn't have the information you're looking to find is that the framework parts are extensively documented in a more global way (example, example). The truly novel parts of the app and server are few and far between; the mobile application is, to a larger extent than one would initially guess, just a pretty wrapper around a small handful of framework calls. The One-Time-Code system for authenticating diagnoses is really the only major concept not defined by Apple and Google.

But, if confirmation from the original designers of this protocol matters, consider this it: Ten minute expiry was my choice and I'm confident that bumping it to 24 hours alongside the substantial increase in keyspace size is an improvement.

cds-snc / covid-alert-documentation

App/server overview is not sufficient to address privacy conerns #10