18F / rdbms-subsetter

Generates a subset of a relational database that respects foreign key constraints
Creative Commons Zero v1.0 Universal
313 stars 30 forks source link

Optionally obscure sensitive information in subset #1

Open catherinedevlin opened 9 years ago

catherinedevlin commented 9 years ago

For protecting PII, etc. Should be able to integrate with an existing library to obscure data while preserving its overall "flavor".

twekberg commented 9 years ago

This would be useful in my group (Laboratory Medicine department in the University of Washington Medical Center) which has databases with PHI (http://en.wikipedia.org/wiki/Protected_health_information). After extracting test data from such a database, the PHI must be mangled prior to storing in a repository.

Perhaps by specifying the PHI columns and their data types, the program could generate random data for that data type. This could work well for scalar types.

dstufft commented 9 years ago

I would find this useful too.

dstufft commented 9 years ago

To be specific, my use case is that I'm the primary developer of Warehouse, which will replace the software that powers PyPI, and one of the challenges of that (as an OSS project itself) is how do we create a public dataset that is representative of the real data without being the entire set of real data and without exposing anything sensitive. Currently my method of doing this is basically manually copying some data and then going in and manually sanitizing it to remove data. It would be great to be able to rely on rdbms-subsetter to automate this for me though.

brki commented 8 years ago

I agree that this would be a great feature.

For the moment, I'm first extracting a subset, then using another tool to do the anonymization.