Better UI for detection of duplicate records (and elimination / merging of same).

kfogel commented 10 years ago

Both TRP and LSNA have expressed a desire to be able to detect (and eliminate / merge) duplicate records. Usually these are records of participants; for example, Sueily at LSNA mentioned that duplicate detection would improve this page:

https://ttm.lisc-chicago.org/lsna/participants/new_participant.php

Filing this as one unified issue for now. If it turns out that very different duplicate-detection methods are needed for different places, then we can break this out into sub-issues, but I hope we can implement a largely shared mechanism.

cecilia-donnelly commented 9 years ago

@cwebber, this might be an issue to check out if you have time and interest. This is definitely something that users across the subsites want.

cwebber commented 9 years ago

Looking into it.

cwebber commented 9 years ago

@kfogel To clarify: do you mean detecting duplicate issues on submission, or detecting while sorting through and curating existing lists of data?

It would be helpful to know:

What's some example of duplicate data? What fields would be checked?
What's the imagined workflow? (Eg, on editing, if duplicate data is found: is the user prompted to re-evaluate their submission? Is there a separate admin page that runs a script to try to find duplicate data? Any imagine on what the expected series of steps the user will encounter and go through? Do the users know what they prefer here at this time, or is it up to us to determine what is possible and present options?)

cecilia-donnelly commented 9 years ago

@cwebber

I think users are most interested in detecting duplicate participant records. E.g., Jay Random signs up for Program X and Jay Random also signs a sign-in sheet for some meeting. Are they the same person? This would probably be determined using their phone numbers or date of birth. Of course, if he just signed in to a meeting we might not have his DOB, so the check would probably need to be on a couple fields. Phone number might not be ideal, because sometimes people don't keep the same number. We probably want to check with users on the exact fields they want to use.
As of right now, the system does warn users if they are entering a new person named Jay Random and there is already a record with the name Jay Random in the system (though they are allowed to enter a second Jay Random, of course). I think the workflow people are imagining for this one would be something like: Click a button that says "detect duplicates." The software looks for "Jay Random" and "James Random" who have the same birthdate and address (or what have you). It returns them side by side and the user says "ignore" or "merge." If merge, they choose which record should be the dominant one.

There is actually already a rudimentary example of this process in the Bickerdike subsite. See bickerdike/users/all_users.php. Users are meant to search for an existing record, then either delete it or search for another record to merge it with. All the duplicate detection and matching is done by the user, in this example.

I hope that helps! Let me know if it isn't clear.

cwebber commented 9 years ago

That helps a lot, @cecilia-donnelly! Thanks!

cecilia-donnelly commented 8 years ago

I talked about this with @kevinrak9 and @kfogel again today. The UI I discussed in this earlier comment https://github.com/OpenTechStrategies/lisc-ttm/issues/36#issuecomment-67038230 can actually be simplified, at least for Enlace's needs. We discussed adding a checkbox next to each name with merge button at bottom of list on the "Participants" page, when search results show up. Most of the time, Kevin said, users will just see two identical names next to each other and will be able to choose to merge those. We'll create an interface for choosing which values to retain in the eventual combined user.

cecilia-donnelly commented 8 years ago

This would take:

UI for choosing two (or more?) participants to merge (probably relatively simple, as discussed in the meeting)
UI for choosing values to retain (participant fields, surveys, program/session membership and attendance)
process for assigning the most up to date values to just one of these participants (the one which will still exist going forward). This wouldn't be so bad once we have the UI designed, since in most cases (e.g. surveys, attendance) it's just a matter of replacing one participant_id with the other.
backend for deprecating the (designated) duplicate
- it could be a lot of work to do this without deleting the deprecated duplicate participant. Currently there's no way to not show existing participants. There's also not just one place where a list of participants is generated. If we wanted to add this, we'd add a "duplicate" column to the Participants table (specifically in the Enlace db, but possibly in all others, too). Then we'd search for all instances of SELECT * FROM Participants and edit each of them to exclude duplicates, based on that column.
- If we wanted to go further, we'd add some conditional on the participant profile so that users could navigate to the profile of a duplicate participant but would see a warning at the top of the screen.

cecilia-donnelly commented 8 years ago

Talked with @kfogel about this one and decided that the thing to do is to combine the first two bullet points above. We don't need to offer a UI for combining surveys, program/session membership, and attendance because the participants won't have conflicts there (they may have redundancy, but that can be managed from the "merged-to" participant profile). So, the workflow will be:

Find the duplicate participants in the search results.
Choose one as the "merged-from" participant and the other as the "merged-to" (new master) participant.
Confirm participant metadata. This will be a new page that displays the name, phone number, etc. for both merged-from and merged-to participants on which the user can update the merged-to participant with the merged-from data. Each metadata field will have an "update" button with an arrow that pushes information (onclick / on submit) from the merged-from participant to the new master (merged-to) participant.

The other data will be assigned to the merge-to participant automatically. Later, we could add an interface for re-checking the metadata against the merged-from participant, if desired. The work for removing the "merged-from" / deprecated participants will be as described in https://github.com/OpenTechStrategies/lisc-ttm/issues/36#issuecomment-170679192.

kevinrak9 commented 8 years ago

@cecilia-donnelly I think the process you described would be great for what we're looking for. Also would deprecated participant records be deleted, or moved to a separate column? I wasn't sure.

cecilia-donnelly commented 8 years ago

Hey @kevinrak9! Glad this process sounds right to you. I think we'd mark the deprecated records, not delete them. So, we'd add a new column to the Participants table called merged_to or similar. The value of that column would be null for non-deprecated (regular) participants. When two participants are merged, the merged-from participant's merged_to column would be updated with the id of the new master participant.

Say I have Cecilia Donnelly with ID of 41 and Cecelia Donnelly with ID of 77, and I want to mark them as duplicates and keep participant 41 as the master copy. I'd go through the workflow above and when I was done, participant 77 will have a merged_to column with a value of 41. Participant 77 would still have null in her merged_to column.

Once we have that in place, we'd update the various places where the code checks the Participants table so that the queries exclude the deprecated participants -- that is, those whose merged_to column has a non-null value. So, I'd only see Cecilia Donnelly, participant 41, in the list of search results, but participant 77 would still exist in the db -- we might think about adding a new "all deprecated participants" export, or similar.

Does that help?

kevinrak9 commented 8 years ago

Yes, that helps - thanks for explaining!

cecilia-donnelly commented 8 years ago

:+1:

OpenTechStrategies / lisc-ttm

Better UI for detection of duplicate records (and elimination / merging of same). #36