LeapBeyond / scrubadub

Clean personally identifiable information from dirty dirty text.
http://scrubadub.readthedocs.io/
Apache License 2.0
397 stars 95 forks source link

Fuzzy regex matching #80

Open thomasbird opened 3 years ago

thomasbird commented 3 years ago

Due to typos or OCR errors regex patterns may not always match when they probably should, e.g. typing capital-O instead of zero in a british postcode, where letters and numbers are not usually interchangeable.

It might be interesting to allow regex's to be matched fuzzily, and the package regex allows this! https://pypi.org/project/regex/#approximate-fuzzy-matching-hg-issue-12-hg-issue-41-hg-issue-109

We should investigate its use instead of the built in re.

aCampello commented 3 years ago

Yes, that should be a really good approach. It seems regex is backwards compatible, so we can replace it!

We have to figure out exactly how many errors we will allow, and perhaps default to 0, to be backwards compatible, but I can visualise that every detector that detects RegexFilth should be able to have a 'exact' regex and it's approximate counterpart.