chansooligans / oagdedupe

Developed for Use by NY Office of the Attorney General: A Python library for scalable entity resolution, using active learning to learn blocking configurations, generate comparison pairs, then clasify matches
https://oagdedupe.readthedocs.io/en/latest/
MIT License
2 stars 1 forks source link

split get_inverted_index_stats() and add_new_comparisons() to smaller functions #119

Closed chansooligans closed 1 year ago

chansooligans commented 1 year ago

As an example of making business logic clearer, see get_inverted_index_stats() function below... it would be nice to separate this into "build inverted index" and "get inverted index stats" in the logic and not just in the repository. A problem is that get_inverted_index_stats() is a single query (it uses temp tables to build inverted indices then obtains comparison pairs, then computes stats using these pairs).

I can fix this by splitting into two functions. One function builds the inverted index and saves in its own table instead of a temp table. And a second function computes the stats.

Can I do insert statements in parallel?

chansooligans commented 1 year ago

need to solve this first: https://github.com/chansooligans/oagdedupe/issues/120

(to build inverted index in separate function, it will need to store it; creating a table in parallel is okay but need to make sure there are no conflicts (two processes trying to create same table); this should not happen with proper caching, but that's not the case with current implementation of multiprocessing)

chansooligans commented 1 year ago

i can just add to abstract repo even if postgres simply returns query strings for "build_inverted_index"