Open reginafcompton opened 5 years ago
@gregmundy and I talked about the above document. We identified a few immediate new steps:
[x] Research. Research solutions for master patient indexes. How do they manage deduplication? Merging records? And also handling large, diverse datasets? Medical industry https://www.gao.gov/assets/700/696426.pdf Census https://census.gov/library/working-papers/2014/adrm/carra-wp-2014-01.html
[ ] Task. Experiment with clustering using dedupe on some sample data. See ar_app_indv.csv in the DSS Sample Data Google drive directory
[ ] Research. For the memory question...Can we run dedupe in multiple threads, e.g., in multiple containers?
Deupe may be useful here for several of your requirements.
@gregmundy see my research results at the end of the planning document: https://docs.google.com/document/d/12K9p7RgLwmAHXKM0lNG_kmsNGHbAzOhN90AU_rtn5C4/edit#heading=h.kt0y9act2nxn
If you think of other resources, please send them my way.
@robinsonkwame read my document, friend! That's the tool I recommend...though admittedly, I am a little relieved that you recommend it, too.
Ah, I just read through the write up now, apologies for not before. Regarding retraining, unsupervised machine learning I can easily envision a process that:
For learning probability distributions, I would recommend looking into TGAN first, although it does not appear to be online or out-of-core learning it is designed for large scale data.
We have a strong case for never merging records, but just linking records (in another table). See notes here: https://docs.google.com/document/d/1A3_zQxccHxuK6RMPvE562EosqYNF3E6Xk-wOBa0dGd0/edit#
This document describes a solution using DataMade's dedupe. It only discusses the process for deduplication in batch process: it does not describe a solution for deduplicating on a per-user, per-post request.