AtlasOfLivingAustralia / ala-dataquality

Data Quality analysis code
1 stars 1 forks source link

Possible improvements to duplicate detection process #6

Open AlexisTindall opened 8 years ago

AlexisTindall commented 8 years ago

The faunal collections would appreciate if the ALA would consider improvements to the process of duplicate detection. It is quite common that a number of specimens of the same species will be collected by the same person in the same location on the same day. A good example of this is a group whale stranding, such as this one: http://biocache.ala.org.au/occurrences/23373789-cf57-4aba-befa-8f4353c8411d;jsessionid=A54B122AFD3D139F2A67FFE203747422.

The process that has detected inferred associations has correctly identified 15 related specimens there, they are associated, as they all stranded together and were collected and processed by the museum on the same day. However, the data quality flags indicate these to be duplicates, and states that the record has ‘failed’ one of its data quality tests.

Many of our organisations are able to make confident assertions that if specimens have differing catalogNumbers, they are not duplicates. This is the case for TMAG. This can't be aplied universally across the collections though. For example, the Australian Museum will deliver data for the skull, skin, and alcohol preserved parts of the same specimen with different catalogNumbers. (e.g. http://biocache.ala.org.au/occurrences/780d319e-e43d-4748-a04c-33268cee2605; http://biocache.ala.org.au/occurrences/5236557a-3f8a-4881-a636-02b2f1d7aa1d; http://biocache.ala.org.au/occurrences/77cf0068-ef30-4967-9134-ec904aa576ac)

Correspondence with Doug Palmer has suggested that it might be possible to apply some rules on a per dataset basis to contradict some of the existing duplicate detections, e.g. to run all the regular duplicate detection processes but to say 'for TMAG, if the occurrences have different catalogNumbers, they are not duplicates, despite what the detection process has suggested'.

This is a suggestion and might not be the only solution to this problem, happy to help contribute to other ideas about improving duplicate detection.