Contact Tracing Using Semantic Graphs and Darknet Anonymization

Karla-Kolumna-333 commented 3 years ago

In the case of pandemics, contact tracing forms the basis for the work of health authorities. I suggest a method to use the CoronaApp for efficient contact tracing without compromising the protection of personal data. For this we can use semantic graphs which are currently experiencing a revival in Artificial Intelligence research. Technically, the core of the proposal is simple to implement. The greater challenge is data protection, which can be achieved by using strict pseudonymization combined with encrypted communication through an anonymization network.

Semantic Graphs

The Corona App uses Bluetooth communication to detect whether another mobile device is in its immediate vicinity. Initially, this can provide an atomic, isolated fragment of information of the form 'device A' 'approaches' 'device B'. A further encounter may then provide another fragment 'device B' 'approaches' 'device C'. Thus, devices A, B and C form a contact chain.

First, these information fragments occur isolated in the individual apps. Provided that each app sends such fragments to a central system, we can combine these fragments, reconstruct contact chains and thus determine potential infection chains.

As I reminded at the beginning, this is a heavy threat to the protection of personal data. However, I believe that this problem is surmountable with pseudonymization and anonymized communication as I will elaborate later.

For the purpose of linking the information fragments in question specialized databases are available which record the fragments and automatically merge them into coherent networks (graphs). These are so-called "triplestores", also known as "RDF stores" or "graph databases".

Such a database can answer with a simple query, with which other persons someone had direct or indirect contact.

An alternative to RDF stores are "labeled property graphs", such as implemented by Neo4J. However, RDF stores also allow for automated reasoning which is a core-concept in symbolic AI to perform complex inferences on knowledge graphs. For example GraphDB by Ontotext is an RDF store which integrates various reasoning options.

While a labeled property graph store is good for applying graph algorithms, using an RDF store will enable the implementation of AI applications in a long term.

Pseudonymization

We do not want government stakeholders or other third parties to abuse such a system for surveillance. Personal data can be protected by the following design.

1) Any app user creates a unique pseudonym

2) Apps share those pseudonyms to send information fragments of the form 'pseudonym A' 'approaches' 'pseudonym B' to the graph store.

3) If a user receives a positive corona report she authorizes the app to report her pseudonym to the graph store.

4) According to 3) A positively tested user's pseudonym is classified as 'reported' in the graph store.

5) According to 4) other pseudonyms in the contact chain are classified as 'affected'. In an RDF store this can be achieved with inference rules and automatic reasoning.

6) Any app regularly asks the system if its pseudonym is classified as 'affected' in the graph store. The answer is simply 'yes' or 'no'.

7) When a user is told that he is classified as 'affected' he may share his pseudonym proactively with the responsible health authority.

8) The responsible authorities only have read access to the graph store and may only look up pseudonyms.

Anonymous Communication

Finally, we have to avoid at any cost that pseudonyms can be associated with mobile devices on the graph store side. In this sense it is a great advantage that - following the description above - any communication between the app and the central graph store is one-way and is only initiated by the app. Based on this, we can protect communication passing it through an anonymization network. Of course any communication must be end-to-end encrypted such that it cannot be read, either from inside or outside the anonymization network.

An ideal candidate for an anonymization network is 'Tor' (also known as the 'darknet') because it cannot be controlled by any government stakeholder. Consequently, any authority depends on the users to share their pseudonyms voluntarily and the users may do so or not, depending on whether they trust their government or not.

Internal Tracking ID: EXPOSUREAPP-4658

Ein-Tim commented 3 years ago

I think you know this but (also for everybody else reading this):

It is not possible to introduce something like this with the current use of the GAEN (Google-Apple-Exposure Notification)-Framework. Corona-Warn-App falls back on the Exposure Notification Framework/System provided by Apple/Google. With this approach, no encounters are stored on a central server, the storage happens only on the user's smartphone (=decentralized).

To learn more how (exactly) the current implementation works, go to coronawarn.app and scroll to the section "How does the app work?". If you want a more detailed explanation, take a look at this blog post and its linked lecture.

MikeMcC399 commented 3 years ago

Die Geschichte der Corona-Warn-App with Lars Roemheld from the German Ministry of Health is a very interesting presentation from rC3 including Q&A at the end. The question of central and decentral data storage, and the decision for the decentral solution is also discussed.

https://pretalx.rc3.studio/rc3-channels-2020/talk/MKBK7C/ for details.

geisslet commented 3 years ago

fyi: there is a nice approach targeting the idea https://www.novid.org/ alive (US only so far and they don't do open source because of IP and cloning).

What stunned me is the basic idea of "see the virus coming" - a graph / network view to calculate the distance of impact based on your behavior as simplified visualization - in combination with different roles (User|Positive User|Symptomatic User|Vaccinated User|Exposed User) even more powerful, but harder with today's limits - but if it's transparent, people will accept (pretty sure). NOVID also use no GPS (just bluetooth, ultrasound, wifi) and I guess grouping (family|friends|work ..) is done by repetition of contact at certain times and with time period. I understand the dependency to GAEN and how hard it is to develop a stable app targeting different generations of phones and OSs - but I don't get it why this not just should work with a decentralized approach, as long as you can sort server-notified-contacts in your local network of contacts. +visualization (there are data scientists at bundesdruckerei who would love to join)

This could be a game changer for the acceptance of the app at all:

there is some gamification in (roles, times of higher caution (could also think of batches)) - which could ingress acceptance / bring some "fun" in ( there is no fun today - and the 3 early 20 aged people I know uninstalled it for lack of info and visible usefulness - what a pity)
there is some proactivity here, because if the hits get closer, the behavior could be adopted - even (if the GPS would be an volunteer option) areas of higher risk could be introduced (if you don't get infected besides high risk would be an week-batch of good/effective self-protection (or luck) ...)
it doesn't clash with the idea who is owning the data and embrace the first objective of the app

I guess it's not about lacking ideas and you didn't wait for my input - but I really would love to see that the app would be a real tool for the people (and a real wide success) - not only for Corona times (flue, malaria, upcoming pandemics ..) - (some) gamification and a proactive information benefit would help a lot .. but I guess u know ..(can't you start a EU funded project with some university? //seufz)

corona-warn-app / cwa-wishlist