J535D165 / recordlinkage

A powerful and modular toolkit for record linkage and duplicate detection in Python
http://recordlinkage.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
966 stars 152 forks source link

Data Corruptors a la GeCO #175

Open aflaxman opened 2 years ago

aflaxman commented 2 years ago

I've been developing some data corruption algorithms (inspired by the documentation from https://dmm.anu.edu.au/geco/flex-data-gen-manual.pdf but not looking at the sourcecode, since it has an unusual license), and I wonder if your excellent project would be interested in some pull requests to incorporate python implementations in your recordlinkage.datasets submodule.

I'm imagining methods such as corrupt.ocr_noise(s : str) -> str. If this sounds of interest, I can put together a PR or use this ticket to further discuss the design. And if this is beyond the scope of what you want for your module, I understand!