WFU-TLC / flc_discussion_board

A repository for discussing questions and issues in the Data Analysis with R (FLC)
https://wfu-tlc.github.io/
0 stars 0 forks source link

anonomize data #21

Open medewitt opened 5 years ago

medewitt commented 5 years ago

Hi all,

EJ and I were having a conversation about anonmyzing data. There are some easy tricks like making a row number fake id. Another, more robust way is to use hashing based on a random seed. This employees an algorithm to make the new ids harder to crack. The important thing is setting a random seed.

I like to use the anonymizer package and use a sha256 encryption algorithm. But you can see that I generate a new column name with the encrypted name.

library(anonymizer)
library(tidyverse)

# Set seed for reproducibility

set.seed(1834)

# Make an anon field that is repoducible
my_cars <- mtcars %>% 
  as_tibble() %>% 
  rownames_to_column(var = "car") %>% 
  mutate(car_anon = anonymize(car, .algo = "sha256"))
medewitt commented 5 years ago

And this is what it looks like:

car car_anon
Mazda RX4 7abdc134e82e67321959fc4a43837295248e41f8063c43fd3b68c1b676bee261
Mazda RX4 Wag 9f7b2a362f12a1b2d80db1ea50780c9fab682de64667ff2de9caf550bbb0e852
Datsun 710 1669a3901b84e1abe1d3a672951608e145a2147560cf9f1a05b47ef5cc10394f
Hornet 4 Drive 79f036a540bd5749185c4faa18a2b2357cf5fd40b2e3e1b92961554d304bf1ee
Hornet Sportabout 1aa0f96674d523365c79a7e0d077aabf1692f59a853c637a0cfa27f7abcf567b
Valiant 2927c30caeceee049a533fc612615ed2d7125cf7f5c860db10ea8fd9643beed9