NHMDenmark / Mass-Digitizer

Common repo for the DaSSCo team
Apache License 2.0
1 stars 0 forks source link

NHMA - Repeated taxon numbers #407

Closed bhsi-snm closed 11 months ago

bhsi-snm commented 11 months ago

Template for isues/tickets in DigiApp

What is the issue ?

There are repeating identifiers in the processed version of the "Revised Checklist of the Lepidoptera of Denmark " (Aarhus_Dk_lepidoptera2013.xlsx)

This is important since a subset of collections, like NHMA, wishes to use serial number based taxonomy and it is being implemented so that users logged in under these collections have an extra field which to use when digitizing.

Ole Karsholt was interviewed about the checklist and told us that the trailing zero in the row identifier was mistakenly added. Due to the checklist processing , the last digit on every record was dropped - this turned out to be fatal since there are 18 records with a non-zero ending digit. Importing the NHMA taxonomy into a DB and running this displays the issue: SELECT * FROM aarhus WHERE sortnr % 10 != 0;

As Fedor mentioned below:

Reduced the number of re-used taxon numbers to 130, but there are still some cases where a single taxon number actually was assigned to multiple taxa:

image

Why is it needed/relevant ?

We need to have the "alternative collection taxonomy" match what is indeed used in the collection in question.

Estimate level of effort required.

Difficult

What is the expected acceptable result.

An app taxonomy that matches the one used at the collection.

We have to approach NHMA and ask them if the identifiers("sortnr") in "Revideret fortegnelse over Danmarks Sommerfugle. Karsholt, Ole; Stadel Nielsen, Per"[1] published taxonomy actually match what they are using in the collection. We need an authoritative answer to this.
We relied on the spreadsheet that was passed around in lateMay/early June and it turned out to be a poor choice for the foundation of an alternative taxonomy for NHMA. Assuming NHMA rely on this, we need to recreate the alternative taxonomy by using the "Revideret fortegnelse over Danmarks Sommerfugle" publication page 17 to 64 as a basis for a translation where the identifiers can be used in Specify.

We can employ the Python PyPDF2 package to read the taxonomy (pages 17 to 64) and transform that text data into say a Pandas data frame (table).

[1] https://snm.ku.dk/ansatte/ansatte/?pure=en%2Fpublications%2Frevideret-fortegnelse-over-danmarks-sommerfugle(b7e34573-98fb-4618-8be8-25753dfb96cf).html

FedorSteeman commented 11 months ago

In the case of taxon number 16, these appear to be synonyms: https://www.gbif.org/species/9578719

FedorSteeman commented 11 months ago

Synonyms do not in themselves constitute an actual issue, since these would be sorted in Specify or in any case GBIF, and both accepted names and synonyms should be available. However, if the app consistently chooses the synonym to populate the taxon fields instead of the accepted name, that would need to be mitigated somehow.

FedorSteeman commented 11 months ago

After conferring with @bhsi-snm and then @PipBrewer the easiest solution is going back to the question why the taxon numbers are repeated in the first place. This seems like a mismatch during import.

FedorSteeman commented 11 months ago

Reduced the number of re-used taxon numbers to 130, but there are still some cases where a single taxon number actually was assigned to multiple taxa:

image

Solution could be to integrate the secondary (serial?) number into the taxon number and advise digitizers to be mindful of and use that one. However, it's unclear whether this would work in practice.

FedorSteeman commented 11 months ago

Note: Fix is invalid, because some of the species were assigned to new genera so the full names don't match. I will need to redo this

FedorSteeman commented 11 months ago

Note: After close investigation the previous was actually correct in removing the taxon number from certain species (like Acalypta gracilis) and that was not because the species were assigned to new genera, but rather because the file I was supplied with processed taxon-sortnr pairs had incorrectly assigned these taxon numbers to those species in question.

However, a different problem was discovered where there was a limited number of cases where the genus name was not added to the full name, which lead to the SQL script (MassDigitizer/sql/specify/UpdateSetTaxonNumber.sql) failing to add those taxon numbers to the taxa in question, since it relied on full name (binomial). The latter file is now fixed by having redone the binomial full name in those cases.

Then the script MassDigitizer/sql/ExtractTaxonNames.sql could be run again resulting in a now correct MassDigitizer/sql/editions/NHMA/entomology/Species-Batch1.sql