Data Anonymization and Pseudonymization

robertmand commented 3 years ago

What topic do you wish to add? This page gives definitions of these terms and suggestions on how to achieve anonymization and pseudonymization of data.

Are there existing pages in the RDM toolkit website related to the requested page? Pages around human sensitive data and GDPR.

Resources If there are there resources that could be utilised for writing the new page, please list them below:

Context If this request is coming from a particular project, domain, or use-case please list them below: A couple of us wrote this at a previous contentathon in googledocs, and forgot to tell people it was there. SO ... I'm putting it in now

Here is the text:

Description Data anonymization is the process of irreversibly modifying personal data in such a way that subjects cannot be identified directly or indirectly by anyone, including the study team. If data are anonymized, no one can link data back to the subject.

Pseudonymization is a process where identifying-fields within data records are replaced by artificial identifiers called pseudonyms or pseudonymized IDs. Pseudonymization ensures no one can link data back to the subject, apart from nominated members of the study team who will be able to link pseudonyms to identifying records, such as name and address.

Data anonymization involves modifying a dataset so that it is impossible to identify a subject from their data. Pseudonymization involves replacing identifying data with artificial IDs, for example, replacing a healthcare record ID with an internal participant ID only known to a named clinician working in the study.

Considerations

Both anonymization and pseudonymization are approaches that comply with the GDPR.
Simply removing identifiers cannot guarantee data anonymity. A dataset may contain unique traits/patterns that could identify individuals. An example of this would be recording 2 potentially unrelated attributes such as the instance of a rare disease and country of residence, where there is only a single case of this disease in this country.
Data that is anonymous currently may not be anonymous in the future. Future datasets on the same individual may disclose their identity.
Anonymization techniques can sometimes damage the statistical properties of the data, for example, translating current participant age into an age range.

Solutions

An example of pseudonymization is where participants in a study are assigned a non-identifying ID and all identifying data (such as name and address) are removed from the metadata to be shared. The mapping of this ID to personal data is held separately and securely by a named researcher who will not share this data.
There are well-established data anonymization approaches, such as k-anonymity, l-diversity, and differential privacy.

Relevant tools and resources

Amnesia

Thanasis Vergoulis vergoulis@athenarc.gr Robert Andrews andrewsr9@cardiff.ac.uk

pinarpink commented 3 years ago

IMO this content can initially go to Data Classification page. Perhaps we might emend the page title 'Data Classification and De-identification'. What say you @bedroesb @floradanna ?

floradanna commented 3 years ago

Yes, it could make sense. Data Classification so far has only 1 sub-problem (how to figure out if your data are sensitive or not). Maybe a second sub-problem could be " how to achieve anonymization and pseudonymization of sensitive data".

bedroesb commented 3 years ago

do we need a new / different tag ?

floradanna commented 3 years ago

if the page is the same, I would not use an additional tag. It could complicate things. We better make use of keywords in this case.

jmenglund commented 3 years ago

I agree with @pinarpink that the Data Classification page is currently the best place for the text. When adding the problem to that page, it is probably a good idea to also take a look at the other problem on that page, "Is my data sensitive?". Some of the bullets under considerations touch upon the same topic.

elixir-europe / rdmkit

Data Anonymization and Pseudonymization #339