Provide additional verification flag for dealing with duplicate records

kitenetter commented 7 years ago

Proposals:

We add an additional term to the verification terms, namely “duplicate record”, as a second level term under the first level “Unconfirmed” category.
We add an additional button to the set of options on the verifiers’ grid.

Background: As more data flows in to the data warehouses we are getting increasing numbers of records that are duplicated, sometimes within the warehouses themselves, but perhaps more problematically when records are added to iRecord that have already been sent to recording schemes via other routes in the past.

This issue is becoming more pressing for various reasons, including:

CSV imports are becoming more widely used, and could result in large numbers of previously-supplied records being re-presented to verifiers.
Some of the moth verifiers are having to deal with data from National Moth Night, and often some (but not necessarily all) of this data is already sitting in county moth databases.
The imminent import of RSPB’s non-avian data will include a proportion of records that have previously been sent to recording schemes direct.
As and when we start moving more scheme data into Indicia we are bound to have duplicates to deal with there as well.

At the moment, the only way that verifiers can remove duplicate records from their grid is to mark them as "Not accepted". This is not ideal:

The records aren’t actually incorrect, so it is wrong to say they can’t be verified
The recorders concerned now have a number of ‘rejected’ records in the “experience” tab on the verification grid, which is unfair as they haven’t misidentified anything

japonicus commented 7 years ago

The BSBI's database has an equivalent 'duplicate' validation concept and I think this could be a useful addition to iRecord, but with some caveats:

Guidance notes should make clear that 'Duplicate' should only apply to records that are 'literal' duplicates of another copy of the same occurrence record. Particularly where verifiers are orientated toward grid-based atlas compilation there is a temptation to mark any repeat observations in the same square as duplicate (even though the records are not identical) - that needs to be discouraged.

For data from structured external databases such as RSPB I think the focus ought to be on avoiding import of subsets of duplicate data, rather than marking the records as duplicate afterwards.

In addition to formal duplicate annotations on records the BSBI's system also has an optional filter that hides duplicates based on taxon-date-gridref compatibility. That approach can help to reduce the work-load for verifiers without rejecting records that, for other purposes, may be distinct, (or include different meta-data). We also often use this filter for reporting purposes when counting 'number of records', because, for historic reasons, we have a high rate of duplicates (~ 20%) that would otherwise distort reported statistics.

MNSmith commented 7 years ago

I just flag them as "Incorrect" and put a comment on the record along the lines of "Duplicate 1234567". If we are going to have essentially a "Duplicate" validation status (and a new button to go along with it) then we need the ability to cross reference back to the original "correct" record .

kitenetter commented 7 years ago

@MNSmith I agree with your approach when a record is duplicated within iRecord itself, but to me it seems that there is a bigger issues of how to deal with records on iRecord that are duplicates of records already held in external recording scheme databases. An increasing number of people wish to upload records from spreadsheets to iRecord, and inevitably some of these uploaded records will have made their way to recording schemes via other routes in the past. So the issue is how we deal with records that are duplicates of offline records, and in that case it seems wrong to mark them as incorrect, for the reasons given above.

johnvanbreda commented 6 years ago

A few points from my perspective: 1) I'm not sure this is a verification record status as such, because a duplicate could be a duplicate of an accepted record or a rejected record. So I think "duplicated" is a flag that stands on it's own. 2) If we had a field called "duplicate of", then this could store either an integer record ID (if the record is on iRecord), or a URL/external reference to the record (rather like an LSID), or perhaps just the name of the system holding the record master copy. The duplicated flag would effectively be whether this field has a value or not. 3) This field could then be set by verifiers, or preferably set by adding the value to the import file so the record is known to be duplicated the moment it arrives in the system (and therefore does not appear for verifiers unless they deliberately change their filter).

This would mean the change specification would be: 1) Add a field, occurrences.duplicate_of (varchar) so that it is available for imports. 2) Update any import guidance to explain the use of the field. 3) Update the verification form to make it easy to set this field's value. 4) Add this field to the reporting cache tables (cache_occurrences_functional). 5) Update the verification form so the default filter is always "duplicate_of is null". 6) Update the standard filter parameters so that the duplicate_of flag can be searched, or filter to show only duplicates, or exclude duplicates.

@kitenetter can I ask you to confirm if you agree with the above please.

kitenetter commented 5 years ago

@johnvanbreda those proposals look very good to me.

Question: if a recorder uploads a recorder and puts something in the occurrences.duplicate_of field, but a verifier subsequently decides that the record is in fact new to them, and thus wants to include as part of a recording scheme dataset, would they need to edit the record to set occurrences.duplicate_of back to null? Or could they just add a normal verification flag to the record for it to be included in any subsequent scheme download?

And to generalise that, I think we are saying that any record that is flagged as a duplicate will likely remain in the "Not reviewed" category, but if a verifier specifically adds a verification status then that status will take precedence over the duplicate flag for reporting and download purposes.

I think there are a couple of additional steps to add to the change specification: Extension to spec 3: Allow verifiers to apply a duplicate value for sets of records (i.e. bulk marking of duplicates, e.g. if a set of record from a particular recorder all needed to be marked as duplciates) New spec 7: Add the occurrences.duplicate_of field to the standard download format

Would be good to have views on this approach from @MNSmith and others.

MNSmith commented 5 years ago

The Asian Hornet Watch app seems to be something that regularly duplicates records, with one coming through under the name of the recorder and one under "Asian Hornet Watch". In these instances I use the "not accepted:incorrect" category for the duplicate and make a note of the other record in the comments text field. I also get the impression some recorders manage to press the "add a record" button more than once using the iRecord app so we get duplicates. It would be useful to have these clearly flagged with a "new" status of "Duplicate" so if needed changed can be made to either edit it to the correct details or delete the record (quite a few recorders do this with iffy records"). Whatever is decided, a duplicate record cannot remain in the "not reviewed" category because it has been looked at and found to be duplicate, and because we need it cleared off the "awaiting review" verification grid. Matt

-----Original Message----- From: kitenetter notifications@github.com To: BiologicalRecordsCentre/iRecord iRecord@noreply.github.com Cc: MNSmith matsmith1@compuserve.com; Mention mention@noreply.github.com Sent: Wed, Oct 10, 2018 3:03 pm Subject: Re: [BiologicalRecordsCentre/iRecord] Provide additional verification flag for dealing with duplicate records (#200)

@johnvanbreda those proposals look very good to me.Question: if a recorder uploads a recorder and puts something in the occurrences.duplicate_of field, but a verifier subsequently decides that the record is in fact new to them, and thus wants to include as part of a recording scheme dataset, would they need to edit the record to set occurrences.duplicate_of back to null? Or could they just add a normal verification flag to the record for it to be included in any subsequent scheme download?And to generalise that, I think we are saying that any record that is flagged as a duplicate will likely remain in the "Not reviewed" category, but if a verifier specifically adds a verification status then that status will take precedence over the duplicate flag for reporting and download purposes.I think there are a couple of additional steps to add to the change specification: Extension to spec 3: Allow verifiers to apply a duplicate value for sets of records (i.e. bulk marking of duplicates, e.g. if a set of record from a particular recorder all needed to be marked as duplciates) New spec 7: Add the occurrences.duplicate_of field to the standard download formatWould be good to have views on this approach from @MNSmith and others.— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

DavidHepper commented 5 years ago

Martin has asked me to comment: This is becoming very complicated. If 'Duplicate' is an independent flag, as JvB proposes, the existing states for a record are doubled. Only records where this flag is false should be included in most queries, e.g. for Species Maps. I'll leave it to the SQL experts to decide whether this will slow the system significantly.

How easy will it be to determine which record the record being marked is a duplicate of? Will it be possible to mark a dupe without specifying the preferred record?

In the case reported by @MNSmith of double-pressing 'add a record' would it not be more appropriate to ask the recorder or importer to Delete the dupe?

From the BDS point of view, the greatest risk of dupes comes from recorders sending the same data to a LERC and directly to the BDS. In this case I don't think I'd wish to mark my BDS records as dupes but neither would I be permitted to mark those of the LERC as dupes...

johnvanbreda commented 5 years ago

Ok, thanks for the input. I think there is clearly a need for a duplicate_of field, which can either be an ID (if the original record is on the same warehouse), an LSID or URL (if the original record has an online representation that we can cross refer to), a description of the record (e.g. the system name followed by the record ID, if the record can't be linked to online), or just "unknown" or something like that.

What I am less clear about is whether we need to have the capability to store a verification status independently of the duplication status, or whether the status "Duplicate" goes in our existing record_status, so a record could not be both accepted and duplicate for example. The advantages of an independent verification status are:

Arguably more logical as duplication is a separate quality issue to the correctness of identification etc.
Ability to accept a record into a scheme dataset, whilst acknowledging the link back to the original source of the record.

The disadvantages of an independent duplication status field are:

Possibly more complex as there is an extra filter field.
Existing reports would need to be updated to filter out duplicates.

It's clear that a duplicate record, however we record it, would not appear in the view of Unconfirmed: Not reviewed records that a verifier sees.

Thinking this through, I am coming to the conclusion that a good way forward is to provide both a duplicate_of field and a separate record substatus which flags a record as "not accepted: duplicate". This way a record can be flagged as a duplicate and the verifier can still fine-tune whether the record is included in their dataset. So, the specification becomes:

Add a field, occurrences.duplicate_of (varchar). This can be set to an internal record ID, an external link, or "unknown" or "other" if the record is a duplicate but no link information can be provided. Any record that is not null in this field is considered a duplicate.
Add a new record substatus of "not accepted: duplicate" which might have the code R6. This status indicates that the record has been excluded from the scheme dataset or Atlas export because of duplication. Therefore a record could have a link in duplicate_of indicating that it is duplicated, but also have record_status=V indicating that the scheme want to include it in their dataset anyway.
Ensure that imports can import into occurrences.duplicate_of. Update any import guidance to explain the use of the field.
Provide an option on the verification form to set a record as a duplicate. This shows a popup with a "Duplicate of" input (defaults to unknown if nothing input by the verifier), a "Record status" drop down with the various options, defaulting to "Not accepted: duplicate". There is a note explaining that the verifier can set a different record status if they would like to override the default behaviour of excluding the duplicate record from the scheme dataset. There is also a comment field which gets logged as a comment (or the system will just log "Record set as duplicate of ..." if not filled in). Therefore if the verifier clicks on the "Duplicate record" button then accepts the default state of the popup form, then the following changes are made, but the verifier would be able to override the behaviour as required:
- Record status set to R (not accepted) and sub-status set to 6 (duplicate)
- Occurrence.duplicate_of set to "unknown"
- occurrence comment added "Record set as duplicate".
On the verification form, the review grid and review ticklist options will be updated to allow setting multiple records to not accepted: duplicate. This would have to default the duplicate_of field to unknown.
The verification form will automatically hide record_status=Not accepted: duplicate records unless the verifier overrides the filter.
Duplicate_of will be shown in the record details pane on the verification page, as well as on the record details page for any record.
The verification form will NOT automatically hide records where duplicate_of is not null, because if the verifier marks a record as duplicate_of something, but record_status = unconfirmed: not reviewed, then they are deliberately choosing to keep the record in their verification queue (I guess this is an unlikely scenario but just wanted to be clear).
The Explore reports and other reports will default to exclude all not accepted records as they do now, so will automatically exclude those that are not accepted: duplicate, but will include records that have a different record status even if they are a duplicate of something.
The Quality pane on the standard filter tool allows filtering for record status = "Not accepted: duplicate"
We'll need a new icon on the Explore grid reports for duplicate records.
Add the duplicate_of field to the reporting cache tables (cache_occurrences_functional).
Update the standard filter parameters (available on verification and Explore reports) so that the duplicate_of flag can be searched, or filter to show only duplicates, or exclude duplicates.
Separately the standard filter parameters will allow filtering for not accepted: duplicate as it does any other record status.
Include the duplicate_of field in the standard download format.

@kitenetter does this cover everything?

DavidRoy commented 5 years ago

Review and identify smaller steps that can be implemented in the short/medium term

DavidRoy commented 5 years ago

First step; add a duplicate comment field?

kitenetter commented 4 years ago

Not yet found time to address this further.

kitenetter commented 4 years ago

Revisit use cases to define more clearly:

what requirements are for different users
how these match to the data model options that John has proposed
user interface options for potential solutions

kitenetter commented 2 months ago

This remains a complex issue without a clear way forward. Closing for now.

BiologicalRecordsCentre / iRecord

Provide additional verification flag for dealing with duplicate records #200