HK3-Lab-Team / pytrousse

PyTrousse collects into one toolbox a set of data wrangling procedures tailored for composing reproducible analytics pipelines.
Apache License 2.0
0 stars 1 forks source link

Add the possibility of fixing the seed when anonymizing data #21

Open lorenz-gorini opened 4 years ago

lorenz-gorini commented 4 years ago

The function pd_extras.anonymize_database.anonymize_data splits a DataFrame containing private infos into two DataFrames containing private infos only and all the other data. The two resulting DataFrames are linked to each other thanks to a ID_OWNER column. The values of the column are created by using nonces (random prefix and suffix added to each string containing all the private infos). The resulting strings are then hashed with SHA256. Since we are using a random prefix and suffix, it may be useful to have the possibility of fixing the random seed.