arx-deidentifier / arx

ARX is a comprehensive open source data anonymization tool aiming to provide scalability and usability. It supports various anonymization techniques, methods for analyzing data quality and re-identification risks and it supports well-known privacy models, such as k-anonymity, l-diversity, t-closeness and differential privacy.
http://arx.deidentifier.org/
Apache License 2.0
620 stars 215 forks source link

Parallelised or distributed version #26

Open Scarlethue opened 9 years ago

Scarlethue commented 9 years ago

I am looking at searching and annonymising data with a large number of records (at least 10m). One of the use cases is for horizontally integrating results from multiple locations without sharing the raw data. While the flash implementation is very fast at the moment it does not appear parallelised for large local sets or distributable for partitioned sets.

prasser commented 9 years ago

(1) Anonymizing distributed datasets in a privacy-preserving manner

You might want to take a look at the approach that we developed based on ARX:

Florian Kohlmayer, Fabian Prasser, Claudia Eckert, Klaus A. Kuhn. A Flexible Approach to Distributed Data Anonymization. Journal of Biomedical Informatics, December 2013 http://dx.doi.org/10.1016/j.jbi.2013.12.002 (* Both authors contributed equally to this work.)

In this paper you will also find an overview of other potential solutions to this problem.

(2) Parallelizing ARX itself

We do have a private fork that prototypically parallelizes ARX to better exploit modern multicore architectures. We might add this functionality to ARX in a future release, depending on demand. We currently have no plans for developing a version of ARX that supports scale-out in a cluster. You should be able to run ARX with datasets consisting of 10's of millions of records on current server hardware. If you experience any limitations please let us know.

lordlinus commented 4 years ago

@prasser I am trying to use ARX in my existing Spark data ingestion pipelines and looking for guidance. I was originally thinking to extend the dataframe and convert the dataframe into ARX Data object and run anonymizer, but not sure if this approach would work for large datasets.

prasser commented 4 years ago

I'm not very familiar with Spark, so it's hard for me to help without any further details. In general, you need to create horizontal partitions and then process the partitions independently and merge the results. If a dataframe allows you to implement this, then it is the right way to go.