brighthive / master-client-index

BrightHive's Master Client Index framework.
MIT License
2 stars 2 forks source link

Initial plan for deduplication #11

Open reginafcompton opened 5 years ago

reginafcompton commented 5 years ago

This document describes a solution using DataMade's dedupe. It only discusses the process for deduplication in batch process: it does not describe a solution for deduplicating on a per-user, per-post request.

reginafcompton commented 5 years ago

@gregmundy and I talked about the above document. We identified a few immediate new steps:

robinsonkwame commented 5 years ago

Deupe may be useful here for several of your requirements.

reginafcompton commented 5 years ago

@gregmundy see my research results at the end of the planning document: https://docs.google.com/document/d/12K9p7RgLwmAHXKM0lNG_kmsNGHbAzOhN90AU_rtn5C4/edit#heading=h.kt0y9act2nxn

If you think of other resources, please send them my way.

reginafcompton commented 5 years ago

@robinsonkwame read my document, friend! That's the tool I recommend...though admittedly, I am a little relieved that you recommend it, too.

robinsonkwame commented 5 years ago

Ah, I just read through the write up now, apologies for not before. Regarding retraining, unsupervised machine learning I can easily envision a process that:

For learning probability distributions, I would recommend looking into TGAN first, although it does not appear to be online or out-of-core learning it is designed for large scale data.

reginafcompton commented 5 years ago

We have a strong case for never merging records, but just linking records (in another table). See notes here: https://docs.google.com/document/d/1A3_zQxccHxuK6RMPvE562EosqYNF3E6Xk-wOBa0dGd0/edit#