J535D165 / recordlinkage

A powerful and modular toolkit for record linkage and duplicate detection in Python
http://recordlinkage.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
967 stars 152 forks source link

Other options to store record pairs #54

Open J535D165 opened 6 years ago

J535D165 commented 6 years ago

Record pairs are stored in pandas.MultiIndex objects. For several users, this object is hard to understand. It would be nice to add an option to store record pairs in other formats like numpy.arrays of even python sets.

mayerantoine commented 5 years ago

How do you see this change : 1) as an option of classifier init class or 2) as Miscellaneous function to change to convert from records pairs to numpy.arrays or python sets. ? please provide more details.

I am currently working on a function that receives the match_index (pandas.MultiIndex ) and returns a list of tuples, grouping all the matched record_id. My next step would be to assign a unique id to each group- the idea is for a dataframe de-duplication to automatically generate a unique Id for all matches. Would this be useful for PRLT ? Where do would you see this integration int the API ?