bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
74 stars 48 forks source link

Updated apply_regex_anonymization #404

Closed ianyu93 closed 2 years ago

ianyu93 commented 2 years ago

anonymization.apply_regex_anonymization previously only takes in ID from muliwai.pii_regexes.regex_rulebase, resulting faulty anonymization. With the update, all keys (NER Types, such as NORP, AGE, etc) will be included.

For test cases in test_anonymization.py, fake.address() is changed to fake.street_address, as fake.address() would return full address with \n.