Open byndcivilization opened 8 years ago
Leaving this open for now, but once we are done with the initial deduplification and reconciliation processes and have some experience with them we can decide whether we need to make improvements (ML, etc.) Assigning @davidmihalyi to be arbitrator of whether the process is working well.
See #147 for a nice use case of how entities need to/should be merged
We have a good first workflow for this. Further improvement will require more thinking in a next phase.
We need a deduplication process to identify possible matches in incoming data.
Solution #1: simple fuzzy matching algorithm that attempts to match on one or a few fields. (name and some additional info)
Solution #2: ML assisted entity reconciliation process. This would use ML methodology to derive a matching score to identify possible duplicates. There would then need to be a UI to either merge the matched entities (show matches and partial matches, i.e exact match of first word, for both project names and company names. Allow user to confirm all or some of the matches for example) or to at least display possible links on an entity page.
Moved from #6