Duplicate Detection: Architecture + UI Layer

AN-BC commented 5 months ago

Building on top of the @memberjunction/ai-vectors package, we will want to implement a duplicate detection architecture that is generic for any entity in MJ and usable in a non-visual way with an object model, as well as a visual way.
Vector DB functionality allows us to automatically define Entity Documents for any entity that can be used for various purposes. The idea here is that a system admin would define an "Entity Document" for a given entity where duplicate checking is desired. In those entities, an entity document is a layout of information from the entity record that is optimized for similarity search. The Vector DB functionality will automatically synchronize embeddings/vector store with the entity data using the @memberjunction/ai-vector-sync package. We need to get this vector sync going on a recurring basis and need standardized infrastructure to run packages of any type on a schedule in MJ too (perhaps break this out as its own issue).
Once we have the above established, basically auto-sync of a given entity and all its database records with embeddings stored in a vector DB of choice (right now only Pinecone is supported, but anyone can create a vector db plug-in for other vector storage providers), we need to implement logic for using this with cosine similarity search in the vector DB to detect duplicates. @JS-BC has already done some testing with this particular request I believe. We want to build some non-visual functionality to do this in a tier-independent manner, I'm envisioning an object model that can be called where you pass in an Entity Name/ID and either a RecordID, or an array of RecordIDs, or possibly a ListID or ViewID, and then that object will handle doing a dupe search for reach record provided, and coming back with results. We probably want to store the results in the database because the process won't be as cut and dry as just running the automated process and then running a merge for duplicates. Each "Candidate Duplicate" will have a score between 0 and 1 for confidence matching. Also, before even using the vector DB, we will want to run an "absolute duplicate" search which should be new piece of metadata for a given entity where you can define an SP that has a input/output shape that we define where the SP can do whatever it wants to look for absolute dupes. For person type records it might look for email exact match, for org level it might look for domain exact matches, as an example. This would be MJ-implementation specific and part of the process of implementing MJ would be to define these SPs for each entity. Records run for each of those SP would come back with a probability score of 1 reflecting them being absolute dupes. Then, we'd run the vector search and we'd get back 0-1 rankings. We would probably want a threshold setting in each entity dupe setup that is configurable in terms of what level to consider "potential dupes" perhaps for some orgs they want to cast a wider net and say 0.7 is good, others might say 0.8, 0.9, etc.
The above process would populate a set of new entities we need to build to track "Potential Duplicate Runs" and "Potential Duplicates" that is used for tracking the process noted above. The outcome of that could just be a manual review of each record, or you could have, on a per-run basis, the ability to "auto-merge" records that are above a certain confidence level like 0.99 or something. We will consider the "Potential Duplicate Run" record to be open so long as any "Potential Duplicate Records" within it are marked as "pending" status vs "Merged" or "Ignored". once a record is marked as "Ready to Merge" or similar, a process will pick that up and run the existing merge functionality that we already have built into the system with https://memberjunction.github.io/MJ/classes/_memberjunction_core.Metadata.html#MergeRecords @JS-BC started building out some of this already but I don't recall exactly where the status is, and I think it is single record based without logging data like described here, but @JS-BC can comment on the status.
Once we have the non-visual layer of this built that does the process end to end completely with logging/etc, we then need to build a UI on top. The UI will need: ability to invoke a duplicate search on an individual record, on a view of records or on a selected group of records within a view. When you kick it off, it would create a duplicate record request that would run in the background and create what is noted in the item above, and when done, a user notification would get fired up like with Skip completing and the user could review that record. That record would have a custom Entity Form that allows a user to review the dupe result, and bulk approve/decline or individual approve/decline merge candidates. Of course with the object model someone could write code to do all of this with whatever logic they wanted. But the UI is important to have as a default.
Finally, we need to implement a way to do this as part of data ingestion. Most users of MJ will use Azure Data Factory (ADF) to pipe in raw data. We need to be able to easily trigger a duplicate search after a given ADF run (or do on a recurring basis) where we are looking for duplicates, but in a slightly different way than noted thus far. The idea for this requirement is to look for similar records across entities. For example if we have a student table in an educationschema and a member table in a crmschema, we want to match records between those tables and populate/maintain a new table called "PersonLink", or similar, in the commonschema. This is the core concept to auto-link records across schemas which can then be used for analytics of various kinds and especially by Skip.

I was thinking this would be a great project for @cadam11 and @JS-BC to attack together post 1.0?

AN-BC commented 3 months ago

@cadam11 I see this is closed but I think part 5 and 6 above are not done. I think we need to have a review/demo of this functionality by @JS-BC to the team so we can all opine on the functionality in the UI and see if we need to enlist UX support to finish this.

As for part (6) - I think we need @hiltongr to opine on that as he will be leading efforts to use this on a variety of user projects.

Can we reopen - or perhaps split out a new issue with those two items to track them?

JS-BC commented 3 months ago

@AN-BC Correct, parts 5 and 6 are not yet complete

AN-BC commented 3 months ago

Let’s reopen the issue for now then

From: JS-BC @.> Sent: Thursday, May 23, 2024 9:58 AM To: MemberJunction/MJ @.> Cc: Amith Nagarajan @.>; Mention @.> Subject: Re: [MemberJunction/MJ] Duplicate Detection: Architecture + UI Layer (Issue #93)

@AN-BChttps://github.com/AN-BC Correct, parts 5 and 6 are not yet complete

— Reply to this email directly, view it on GitHubhttps://github.com/MemberJunction/MJ/issues/93#issuecomment-2127354920, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AXGYIQLLIBOPHJMRENCBLVLZDX7XLAVCNFSM6AAAAABF2UGXSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRXGM2TIOJSGA. You are receiving this because you were mentioned.Message ID: @.***>

MemberJunction / MJ

Duplicate Detection: Architecture + UI Layer #93