epiverse-trace / linelist

R package for handling linelist data
https://epiverse-trace.github.io/linelist/
Other
8 stars 4 forks source link

Anonymisation and anonymity testing #75

Open sbfnk opened 1 year ago

sbfnk commented 1 year ago

A fundamental barrier for sharing linelist data for further analysis/processing is the risk of identification of individuals, with substantial ethical and, potentially, legal implications. I wonder if linelist could help mitigate this risk by providing tools for users to help with ensuring none of the data contained is identifiable.

I can see two potential functions that linelist could provide:

  1. A function to assess the re-identification risk, e.g. calculating its k-anonymity
  2. Some support to reduce re-identification risk, e.g. by replacing a column or set of columns with a unique identifier.
Bisaloo commented 1 year ago

Thanks for starting this conversation!

I think this will be partly addressed by the coming Privacy Enhancing Techniques challenge. The data type will be slightly different but the methods can probably be used here as well.

In terms of scope, I believe this should live in a different package. The linelist package should only define the linelist object format, and the basic methods to manipulate it. Any other complex operation on linelist objects should likely live in a separate package.

Bisaloo commented 1 year ago

Useful related resource: https://osf.io/xpj38/

Bisaloo commented 9 months ago

Thanks again for the suggestion but I've thought about this more and I'm convinced this is outside the scope of linelist. I'm happy to collaborate on a separate package that would be interoperable with linelist and focus on anonymisation.

Some existing resources have been shared in https://github.com/WHO-Collaboratory/collaboratory-epipipeline-community/discussions/12 but please do open a new thread in the discussion board if you believe there are still gaps in the ecosystem or that it would be worthwhile to provide an alternative.

Bisaloo commented 1 month ago

Thinking again about this and based on the feedback during the DPGA submission, I think there would be value to add a couple of extra lines in make_linelist() to warn users if they are working with data with re-identification risk (via k-anonymity testing).

I still think anonymisation is out of scope but a warning will be nice.