Closed kitenetter closed 2 months ago
The BSBI's database has an equivalent 'duplicate' validation concept and I think this could be a useful addition to iRecord, but with some caveats:
Guidance notes should make clear that 'Duplicate' should only apply to records that are 'literal' duplicates of another copy of the same occurrence record. Particularly where verifiers are orientated toward grid-based atlas compilation there is a temptation to mark any repeat observations in the same square as duplicate (even though the records are not identical) - that needs to be discouraged.
For data from structured external databases such as RSPB I think the focus ought to be on avoiding import of subsets of duplicate data, rather than marking the records as duplicate afterwards.
In addition to formal duplicate annotations on records the BSBI's system also has an optional filter that hides duplicates based on taxon-date-gridref compatibility. That approach can help to reduce the work-load for verifiers without rejecting records that, for other purposes, may be distinct, (or include different meta-data). We also often use this filter for reporting purposes when counting 'number of records', because, for historic reasons, we have a high rate of duplicates (~ 20%) that would otherwise distort reported statistics.
I just flag them as "Incorrect" and put a comment on the record along the lines of "Duplicate 1234567". If we are going to have essentially a "Duplicate" validation status (and a new button to go along with it) then we need the ability to cross reference back to the original "correct" record .
@MNSmith I agree with your approach when a record is duplicated within iRecord itself, but to me it seems that there is a bigger issues of how to deal with records on iRecord that are duplicates of records already held in external recording scheme databases. An increasing number of people wish to upload records from spreadsheets to iRecord, and inevitably some of these uploaded records will have made their way to recording schemes via other routes in the past. So the issue is how we deal with records that are duplicates of offline records, and in that case it seems wrong to mark them as incorrect, for the reasons given above.
A few points from my perspective: 1) I'm not sure this is a verification record status as such, because a duplicate could be a duplicate of an accepted record or a rejected record. So I think "duplicated" is a flag that stands on it's own. 2) If we had a field called "duplicate of", then this could store either an integer record ID (if the record is on iRecord), or a URL/external reference to the record (rather like an LSID), or perhaps just the name of the system holding the record master copy. The duplicated flag would effectively be whether this field has a value or not. 3) This field could then be set by verifiers, or preferably set by adding the value to the import file so the record is known to be duplicated the moment it arrives in the system (and therefore does not appear for verifiers unless they deliberately change their filter).
This would mean the change specification would be: 1) Add a field, occurrences.duplicate_of (varchar) so that it is available for imports. 2) Update any import guidance to explain the use of the field. 3) Update the verification form to make it easy to set this field's value. 4) Add this field to the reporting cache tables (cache_occurrences_functional). 5) Update the verification form so the default filter is always "duplicate_of is null". 6) Update the standard filter parameters so that the duplicate_of flag can be searched, or filter to show only duplicates, or exclude duplicates.
@kitenetter can I ask you to confirm if you agree with the above please.
@johnvanbreda those proposals look very good to me.
Question: if a recorder uploads a recorder and puts something in the occurrences.duplicate_of field, but a verifier subsequently decides that the record is in fact new to them, and thus wants to include as part of a recording scheme dataset, would they need to edit the record to set occurrences.duplicate_of back to null? Or could they just add a normal verification flag to the record for it to be included in any subsequent scheme download?
And to generalise that, I think we are saying that any record that is flagged as a duplicate will likely remain in the "Not reviewed" category, but if a verifier specifically adds a verification status then that status will take precedence over the duplicate flag for reporting and download purposes.
I think there are a couple of additional steps to add to the change specification: Extension to spec 3: Allow verifiers to apply a duplicate value for sets of records (i.e. bulk marking of duplicates, e.g. if a set of record from a particular recorder all needed to be marked as duplciates) New spec 7: Add the occurrences.duplicate_of field to the standard download format
Would be good to have views on this approach from @MNSmith and others.
The Asian Hornet Watch app seems to be something that regularly duplicates records, with one coming through under the name of the recorder and one under "Asian Hornet Watch". In these instances I use the "not accepted:incorrect" category for the duplicate and make a note of the other record in the comments text field. I also get the impression some recorders manage to press the "add a record" button more than once using the iRecord app so we get duplicates. It would be useful to have these clearly flagged with a "new" status of "Duplicate" so if needed changed can be made to either edit it to the correct details or delete the record (quite a few recorders do this with iffy records"). Whatever is decided, a duplicate record cannot remain in the "not reviewed" category because it has been looked at and found to be duplicate, and because we need it cleared off the "awaiting review" verification grid. Matt
-----Original Message----- From: kitenetter notifications@github.com To: BiologicalRecordsCentre/iRecord iRecord@noreply.github.com Cc: MNSmith matsmith1@compuserve.com; Mention mention@noreply.github.com Sent: Wed, Oct 10, 2018 3:03 pm Subject: Re: [BiologicalRecordsCentre/iRecord] Provide additional verification flag for dealing with duplicate records (#200)
@johnvanbreda those proposals look very good to me.Question: if a recorder uploads a recorder and puts something in the occurrences.duplicate_of field, but a verifier subsequently decides that the record is in fact new to them, and thus wants to include as part of a recording scheme dataset, would they need to edit the record to set occurrences.duplicate_of back to null? Or could they just add a normal verification flag to the record for it to be included in any subsequent scheme download?And to generalise that, I think we are saying that any record that is flagged as a duplicate will likely remain in the "Not reviewed" category, but if a verifier specifically adds a verification status then that status will take precedence over the duplicate flag for reporting and download purposes.I think there are a couple of additional steps to add to the change specification: Extension to spec 3: Allow verifiers to apply a duplicate value for sets of records (i.e. bulk marking of duplicates, e.g. if a set of record from a particular recorder all needed to be marked as duplciates) New spec 7: Add the occurrences.duplicate_of field to the standard download formatWould be good to have views on this approach from @MNSmith and others.— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Martin has asked me to comment: This is becoming very complicated. If 'Duplicate' is an independent flag, as JvB proposes, the existing states for a record are doubled. Only records where this flag is false should be included in most queries, e.g. for Species Maps. I'll leave it to the SQL experts to decide whether this will slow the system significantly.
How easy will it be to determine which record the record being marked is a duplicate of? Will it be possible to mark a dupe without specifying the preferred record?
In the case reported by @MNSmith of double-pressing 'add a record' would it not be more appropriate to ask the recorder or importer to Delete the dupe?
From the BDS point of view, the greatest risk of dupes comes from recorders sending the same data to a LERC and directly to the BDS. In this case I don't think I'd wish to mark my BDS records as dupes but neither would I be permitted to mark those of the LERC as dupes...
Ok, thanks for the input. I think there is clearly a need for a duplicate_of field, which can either be an ID (if the original record is on the same warehouse), an LSID or URL (if the original record has an online representation that we can cross refer to), a description of the record (e.g. the system name followed by the record ID, if the record can't be linked to online), or just "unknown" or something like that.
What I am less clear about is whether we need to have the capability to store a verification status independently of the duplication status, or whether the status "Duplicate" goes in our existing record_status, so a record could not be both accepted and duplicate for example. The advantages of an independent verification status are:
The disadvantages of an independent duplication status field are:
It's clear that a duplicate record, however we record it, would not appear in the view of Unconfirmed: Not reviewed records that a verifier sees.
Thinking this through, I am coming to the conclusion that a good way forward is to provide both a duplicate_of field and a separate record substatus which flags a record as "not accepted: duplicate". This way a record can be flagged as a duplicate and the verifier can still fine-tune whether the record is included in their dataset. So, the specification becomes:
@kitenetter does this cover everything?
Review and identify smaller steps that can be implemented in the short/medium term
First step; add a duplicate comment field?
Not yet found time to address this further.
Revisit use cases to define more clearly:
This remains a complex issue without a clear way forward. Closing for now.
Proposals:
Background: As more data flows in to the data warehouses we are getting increasing numbers of records that are duplicated, sometimes within the warehouses themselves, but perhaps more problematically when records are added to iRecord that have already been sent to recording schemes via other routes in the past.
This issue is becoming more pressing for various reasons, including:
At the moment, the only way that verifiers can remove duplicate records from their grid is to mark them as "Not accepted". This is not ideal: