chansooligans / oagdedupe

Developed for Use by NY Office of the Attorney General: A Python library for scalable entity resolution, using active learning to learn blocking configurations, generate comparison pairs, then clasify matches
https://oagdedupe.readthedocs.io/en/latest/
MIT License
2 stars 1 forks source link

91 in database option for clustering connected components #129

Open chansooligans opened 1 year ago

chansooligans commented 1 year ago

Purpose of this PR is to abstract out the clustering logic and move the "get_connected_components" function into repository. For the postgres repository, the "get_connected_components" logic can be solved using the pgrouting library. This library extends postgres with network analysis tools.

chansooligans commented 1 year ago

Need to work on query for record linkage

"left" dataframe entities should only link to "right" dataframe entitites also debug to make sure test script run_rl.py works as expected

TRUNCATE TABLE {settings.db.db_schema}.clusters;
INSERT INTO {settings.db.db_schema}.clusters (cluster, _index, _type)
SELECT 
    component as cluster, 
    node * -1 as _index, 
    CASE 
        WHEN node >= 0 THEN True
        ELSE False
    END as _type 
FROM pgr_connectedComponents(
        'SELECT
            ROW_NUMBER() OVER (ORDER BY _index_l,_index_r) as id,
            _index_l as source,
            -1*_index_r as target,
            score as cost
        FROM {settings.db.db_schema}.scores'
    );
"""