cdaller / multi_anonymizer

Anonymize connected data in multiple csv or xml files
Apache License 2.0
17 stars 4 forks source link

auto anonymisation by regex or dictionary #5

Open pax opened 5 months ago

pax commented 5 months ago

It would awesome to have a auto / multi-column anonymisation feature, using regexp where possible (IBAN, card numbers, email, national personal identification code) and dictionaries [1] for names / geo names, per country.

In a lot of cases (some bank statements) variables/attributes are not stand-alone, but bundled in one cell.

[1] name-dataset, forebears.io, firstname-database, topics/surnames.

cdaller commented 4 months ago

you mean that you do not have columns in csv or json/xml properties, but have a mix of name/email/iban in one column and still want to anonymize the data?

The library used to anonymize can handle country specific anonymization, just use the --locale parameter. Then the names/addresses/etc. will be country specific.

pax commented 4 months ago

have a mix of name/email/iban in one column

yes, annoyingly so

also, I would imagine other cases of columns with long text content that might contain strings that need anonymisation

Screenshot 2024-03-28 at 15 08 01
cdaller commented 4 months ago

Hello Alex,

I added the feature that allows to match text in csv cells using regular expressions. See readme.md for details. I hope this resolves your issues!

Have fun.... Christof