Open kfogel opened 10 years ago
@cwebber, this might be an issue to check out if you have time and interest. This is definitely something that users across the subsites want.
Looking into it.
@kfogel To clarify: do you mean detecting duplicate issues on submission, or detecting while sorting through and curating existing lists of data?
It would be helpful to know:
@cwebber
There is actually already a rudimentary example of this process in the Bickerdike subsite. See bickerdike/users/all_users.php. Users are meant to search for an existing record, then either delete it or search for another record to merge it with. All the duplicate detection and matching is done by the user, in this example.
I hope that helps! Let me know if it isn't clear.
That helps a lot, @cecilia-donnelly! Thanks!
I talked about this with @kevinrak9 and @kfogel again today. The UI I discussed in this earlier comment https://github.com/OpenTechStrategies/lisc-ttm/issues/36#issuecomment-67038230 can actually be simplified, at least for Enlace's needs. We discussed adding a checkbox next to each name with merge button at bottom of list on the "Participants" page, when search results show up. Most of the time, Kevin said, users will just see two identical names next to each other and will be able to choose to merge those. We'll create an interface for choosing which values to retain in the eventual combined user.
This would take:
participant_id
with the other.SELECT * FROM Participants
and edit each of them to exclude duplicates, based on that column.Talked with @kfogel about this one and decided that the thing to do is to combine the first two bullet points above. We don't need to offer a UI for combining surveys, program/session membership, and attendance because the participants won't have conflicts there (they may have redundancy, but that can be managed from the "merged-to" participant profile). So, the workflow will be:
The other data will be assigned to the merge-to participant automatically. Later, we could add an interface for re-checking the metadata against the merged-from participant, if desired. The work for removing the "merged-from" / deprecated participants will be as described in https://github.com/OpenTechStrategies/lisc-ttm/issues/36#issuecomment-170679192.
@cecilia-donnelly I think the process you described would be great for what we're looking for. Also would deprecated participant records be deleted, or moved to a separate column? I wasn't sure.
Hey @kevinrak9!
Glad this process sounds right to you. I think we'd mark the deprecated records, not delete them. So, we'd add a new column to the Participants
table called merged_to
or similar. The value of that column would be null
for non-deprecated (regular) participants. When two participants are merged, the merged-from participant's merged_to
column would be updated with the id
of the new master participant.
Say I have Cecilia Donnelly with ID of 41 and Cecelia Donnelly with ID of 77, and I want to mark them as duplicates and keep participant 41 as the master copy. I'd go through the workflow above and when I was done, participant 77 will have a merged_to
column with a value of 41. Participant 77 would still have null
in her merged_to
column.
Once we have that in place, we'd update the various places where the code checks the Participants
table so that the queries exclude the deprecated participants -- that is, those whose merged_to
column has a non-null value. So, I'd only see Cecilia Donnelly, participant 41, in the list of search results, but participant 77 would still exist in the db -- we might think about adding a new "all deprecated participants" export, or similar.
Does that help?
Yes, that helps - thanks for explaining!
:+1:
Both TRP and LSNA have expressed a desire to be able to detect (and eliminate / merge) duplicate records. Usually these are records of participants; for example, Sueily at LSNA mentioned that duplicate detection would improve this page:
https://ttm.lisc-chicago.org/lsna/participants/new_participant.php
Filing this as one unified issue for now. If it turns out that very different duplicate-detection methods are needed for different places, then we can break this out into sub-issues, but I hope we can implement a largely shared mechanism.