arkhn / dsPrivacy

A DataShield Differential Privacy Library
Apache License 2.0
6 stars 0 forks source link

Cross operations privacy accounting #11

Open LaRiffle opened 2 years ago

LaRiffle commented 2 years ago

Problem

The differential privacy budget is actually tight to the data of an individual that is used by several analysts with several studies including several functions. Hence, a global accounting of the DP budget must be set in place. In hospitals, pseudonymized datasets should refer to patients using a unique set of pseudonym ids. We can therefore rely on this pseudonym to account for the sum of the budget spent for a patient.

Description

A basic yet acceptable accounting method when using simple DP functions based on laplacian noise, is to sum the individual budget of each query that used the data of a particular individual/patient, to compute the total budget spent for this individual. While, we should probably not sum the budget of 2 queries made on distinct attributes of an individual (say the age and the weight), it is probably simple not to consider this distinction at first, because those attributes are not independent, and measuring correlations would bring quite some issues.

As a consequence, we should have per hospital, a single accounting server, which records for each pseudonym the global DP budget spent, as the sum of the budget of each individual queries. For traceability purpose, it would be interesting to store not only the budget, but also the query name (e.g "sumDP"), the analyst id, the date, the workspace id (each analyst can theoretically have several workspaces if they work simultaneously on several studies). It will also allow to set and track not only a budget per patient data but also per analyst and/or per study, to avoid that one analyst spends all the privacy budget of a patient.

For each query execution, the server side function should request permission to the accounting server, by providing all appropriate information above, and should receive a per patient approval/denial (because some patients may have spent already all their budget while some other didn't). There should be an option to 1) either abort if even a single patient was opted out by the accounting server 2) or run the query only on the approved patients (which could led to some bias).

LaRiffle commented 2 years ago

@Jasopaum @naudinlo I've add some details on this issue, what do you think?

naudinlo commented 2 years ago

To formalise, if we want to keep track of how much budget was spend for each patient, we should maintain a table with budget spent for each query for each patient. Analysts could make queries for patient whose budget was not finished.

For instance in the case of a total budget of 100, here patient_3 could not be a part of a query anymore:

Screenshot 2022-09-14 at 15 14 20

We could go further by differencing the budget per required attributes, that way when new data arrived for a patient we can take it into account even if the budget for this patient has been spent.

Regarding your question whether to run the query with patients opted out or abort the query, I feel it is more of a medical opinion, as indeed it could introduce some bias.

LaRiffle commented 2 years ago

Hmmm if we're talking about data model, we could opt for something like this:

TRANSACTION (TABLE)
id
patient_id
analyst_id
study_id
date
epsilon
operation_name
operation_args   # all extra information from the serialized call, depending on how its available with DS

where actually study_id might not exist for the moment, in the sense that I guess that we will creat a separate DS account for each study, even for the same analyst, to make sure analysts don't have a workspace with several studies.

Some tables for the specific limits:

PATIENT_LIMIT
patient_id
limit # the maximum budget that should ever be spent

same with ANALYST and STUDY.

And perhaps a view for a global overview of the remaining budget per patient

PATIENT_BALANCE (VIEW) # 
patient_id
budget_spent AS SELECT SUM(epsilon) FROM TRANSACTION WHERE patient_id == patient_id # or something like this

And possibly a direct up to TRANSACTION to know for a given study or analyst how much budget was spent

RonanMorgan commented 2 years ago

Each of the following assertions should be checked :)

Main Constraints :

Main functional features :

Optional features :

Technical notes :

Workflow:

LaRiffle commented 2 years ago

Discussion with @Jasopaum

We accept that we set the privacy budget globally per dataset per study.

We need a dataset with one line per patient. => is it realistic?? So a dataset is a table.

Analyst :

1 opal account determines the dataset used and the study => we can account privacy directly on the opal account.

Task division:

Using R allows to catch directly the main functions called, while Python might only give access to sub functions when they exist. => depends on how hard it is to do it in R.

RonanMorgan commented 2 years ago

I am really surprised by these constraints, what is the link with your conclusions ?

LaRiffle commented 2 years ago

@RonanMorgan this is a simplification that we have chosen to help us, it's not a constraint, it helps us to have "1 opal account determines the dataset used and the study => we can account privacy directly on the opal account."

MiskoG commented 2 years ago

✋ on hold issue, I'll check with the darah project team (cc @ogirardot)