Developed for Use by NY Office of the Attorney General: A Python library for scalable entity resolution, using active learning to learn blocking configurations, generate comparison pairs, then clasify matches
Heavily inspired by the Architecture Patterns with Python book, I wondered what the domain layer might look like if it were completely unencumbered by database implementation logic? The oagdedupe/simple module is an attempt at that.
primary use case here is ngram schemes: signature is a set of strings, and equality between two sets is whether or not there is an intersection
The top-level API has a get_entities method that finds the best conjunctions (not implemented), gets pairs, classifies pairs, and clusters records based on those classifications.
There's a lot here that isn't implemented yet, but basic tests using fakes pass. Interested in your thoughts @chansooligans, I have some thoughts on how this could be useful besides an interesting toy model.
Merging as the changes are purely additive (I just did a rebase from master, I hope this doesn't mess anything up but will keep the commit history around if it does)
Heavily inspired by the Architecture Patterns with Python book, I wondered what the domain layer might look like if it were completely unencumbered by database implementation logic? The
oagdedupe/simple
module is an attempt at that.The most basic concepts are
And some abstractions like
The top-level API has a get_entities method that finds the best conjunctions (not implemented), gets pairs, classifies pairs, and clusters records based on those classifications.
There's a lot here that isn't implemented yet, but basic tests using fakes pass. Interested in your thoughts @chansooligans, I have some thoughts on how this could be useful besides an interesting toy model.