Cross operations privacy accounting

LaRiffle commented 2 years ago

Problem

The differential privacy budget is actually tight to the data of an individual that is used by several analysts with several studies including several functions. Hence, a global accounting of the DP budget must be set in place. In hospitals, pseudonymized datasets should refer to patients using a unique set of pseudonym ids. We can therefore rely on this pseudonym to account for the sum of the budget spent for a patient.

Description

A basic yet acceptable accounting method when using simple DP functions based on laplacian noise, is to sum the individual budget of each query that used the data of a particular individual/patient, to compute the total budget spent for this individual. While, we should probably not sum the budget of 2 queries made on distinct attributes of an individual (say the age and the weight), it is probably simple not to consider this distinction at first, because those attributes are not independent, and measuring correlations would bring quite some issues.

As a consequence, we should have per hospital, a single accounting server, which records for each pseudonym the global DP budget spent, as the sum of the budget of each individual queries. For traceability purpose, it would be interesting to store not only the budget, but also the query name (e.g "sumDP"), the analyst id, the date, the workspace id (each analyst can theoretically have several workspaces if they work simultaneously on several studies). It will also allow to set and track not only a budget per patient data but also per analyst and/or per study, to avoid that one analyst spends all the privacy budget of a patient.

For each query execution, the server side function should request permission to the accounting server, by providing all appropriate information above, and should receive a per patient approval/denial (because some patients may have spent already all their budget while some other didn't). There should be an option to 1) either abort if even a single patient was opted out by the accounting server 2) or run the query only on the approved patients (which could led to some bias).

LaRiffle commented 2 years ago

@Jasopaum @naudinlo I've add some details on this issue, what do you think?

naudinlo commented 2 years ago

To formalise, if we want to keep track of how much budget was spend for each patient, we should maintain a table with budget spent for each query for each patient. Analysts could make queries for patient whose budget was not finished.

For instance in the case of a total budget of 100, here patient_3 could not be a part of a query anymore:

We could go further by differencing the budget per required attributes, that way when new data arrived for a patient we can take it into account even if the budget for this patient has been spent.

Regarding your question whether to run the query with patients opted out or abort the query, I feel it is more of a medical opinion, as indeed it could introduce some bias.

LaRiffle commented 2 years ago

Hmmm if we're talking about data model, we could opt for something like this:

TRANSACTION (TABLE)
id
patient_id
analyst_id
study_id
date
epsilon
operation_name
operation_args   # all extra information from the serialized call, depending on how its available with DS

where actually study_id might not exist for the moment, in the sense that I guess that we will creat a separate DS account for each study, even for the same analyst, to make sure analysts don't have a workspace with several studies.

Some tables for the specific limits:

PATIENT_LIMIT
patient_id
limit # the maximum budget that should ever be spent

same with ANALYST and STUDY.

And perhaps a view for a global overview of the remaining budget per patient

PATIENT_BALANCE (VIEW) # 
patient_id
budget_spent AS SELECT SUM(epsilon) FROM TRANSACTION WHERE patient_id == patient_id # or something like this

And possibly a direct up to TRANSACTION to know for a given study or analyst how much budget was spent

RonanMorgan commented 2 years ago

Each of the following assertions should be checked :)

Main Constraints :

a query cannot be done if the budget used for this query is less than the "global budget" remaining
a query cannot be done if the budget used for this query is less than the "study budget" remaining
a query cannot be done if the budget used for this query is less than the "personal budget" of each patient included in the dataset

Main functional features :

the data scientists should be able to access the remaining "global budget" and "study budgets"

Optional features :

the data scientist should be able to access the level of noise produced by the budget he has used for his query (or even asked only for the noise without doing the query at all)
if the exact same query is launched twice, should we return the exact same result ? (if yes, lets keep the queries somewhere ...)

Technical notes :

each query has to be linked to a study
a study is equal to a "project" in datashield
a dataset is attached to a study
the dataset targeted by a query can be a subset of the "study dataset" which is a subset of the "Darah dataset" : should we keep track of each patientId targeted by each query has suggested by @LaRiffle ? Having the query + the "studyId" is not enough ?
in order to load data we have used "view" in opal's projects. We cannot load data automatically into opal. We could try a complexe architecture with an external postgresql but I think we should try to load the "Darah Dataset" into opal as a first step, in order to see if it's easy to create subset attached to the project. If it's the case we should be able to use the Variables and Entities to manage the state thanks to the library (we also need to check if we can easily access these variables and data throught the library) .
queries can be done in //. Should we change the level of isolation ?

Workflow:

should the datascientist have one notebook per study / project ?
do they need versioning ?

LaRiffle commented 2 years ago

Discussion with @Jasopaum

We accept that we set the privacy budget globally per dataset per study.

We need a dataset with one line per patient. => is it realistic?? So a dataset is a table.

Analyst :

1 VPN
1 Rstudio account (+ linux) per study
1 Opal Account per study => one project with one dataset=table

1 opal account determines the dataset used and the study => we can account privacy directly on the opal account.

Task division:

Can we find the opal account on the server part of the package, which has made the request, and if it is an admin? If not, it should add a custom user_id in the arguments. (needs a correspondance table, who manage it?)
Budget checks on request: Make the management of the budget directly through a dedicated DS package (or part of dsPrivacy actually)
- Leverage Opal DB to store our transaction / privacy limits
- For each function, make a permission check, and log the transaction when accepted, send an error msg when no budget left
- have a function to ask how much budget left
budget allocation: Write operations to the DB are done by a user, the hospital IT manager, that creates through the app a study (ie a project with one dataset=table, and the associated accounts), and then use the client side of the package to allocate some budget for this study, ie for some user accounts. If we can on the server part know that it is an Opal "admin" account, then it provides directly the permission layer. If not, needs a correspondance table for admin, who manage it?
Admin traceability: Write operations should be traced, and traces should be non-modifiable (allocating budget can be seen as a positive transaction, where queries as negative transactions)

Using R allows to catch directly the main functions called, while Python might only give access to sub functions when they exist. => depends on how hard it is to do it in R.

RonanMorgan commented 2 years ago

1 Rstudio account (+ linux) per study
1 Opal Account per study => one project with one dataset=table

I am really surprised by these constraints, what is the link with your conclusions ?

LaRiffle commented 2 years ago

@RonanMorgan this is a simplification that we have chosen to help us, it's not a constraint, it helps us to have "1 opal account determines the dataset used and the study => we can account privacy directly on the opal account."

MiskoG commented 2 years ago

✋ on hold issue, I'll check with the darah project team (cc @ogirardot)

arkhn / dsPrivacy

Cross operations privacy accounting #11

Problem

Description