chansooligans / oagdedupe

Developed for Use by NY Office of the Attorney General: A Python library for scalable entity resolution, using active learning to learn blocking configurations, generate comparison pairs, then clasify matches
https://oagdedupe.readthedocs.io/en/latest/
MIT License
2 stars 1 forks source link

decoupling compute: distance #111

Closed chansooligans closed 2 years ago

chansooligans commented 2 years ago

The distance submodule computes distances between attributes. For string-type attributes, it computes string distances using metrics like jaro-winkler similarity or levenshtein. For numeric-type, the absolute difference. And so on...

So to apply the bridge pattern, the abstraction would be an abstraction of this distance class. And it has an Engine abstraction. An engine can use any of postgresql, sqlite, pandas, spark for computations