Early Digi app exports/imports without author name: backfill

jlegind commented 6 months ago

What is the issue

The app will be updated with vascular plants taxon spine to includes author names. However, the already imported specimen records did not have author associated with the taxon name. We have to retroactively add the author name to records where that name is available in the specimen image. There are ~ 15.000 vascular plant records as of 19-12-2023 in Specify. The period is May 2023 to December 19. 2023. These records will eventually all have an image associated.

Description

The imported specimen records ought to have the author name along with the taxon name if available. The author is often written after the taxon name on the label and might thus be retrieved from the image associated with the specimen. Unless we have a pipeline that can read the author name, this will require a lot of manual work. The images can be identified by the associated barcode, yet that has to be read via a pipeline. When that pipeline goes operational we can go from the specimen record straight to the image and look for the author name so that can be added. Waiting for the pipeline to be ready is the only feasible way forward.

Why is it needed/relevant ?

There is the issue of homonyms and disambiguation which author names can help solve.

Estimate level of effort required.

Hard

How to approach it?

It would be preferable for the pipeline to identify the barcodes for the images and then rename those images with the barcode. This enables us to go specimen by specimen. If a pipeline can read the authors - or at least some of the authors - that would be tremendously helpful. Reading handwritten author names and adding them to the taxonomic name is yet another issue (separate though).

FedorSteeman commented 5 months ago

This will need to be fixed eventually, but not necessarily ríght now as part of the sprint towards v1.2. Will draft these and other issues with @bhsi-snm

jlegind commented 5 months ago

Suggested Solution:

A solution could be to have a spreadsheet with drop downs for each name having been imported that has more than one authored names. The drop list contains the authors applicable to that name. The process would be:

To query for the records that were already imported into Vascular plants in Specify. Match those names to the new taxonomy where author names exist. Those names existing more than twice with different author names are separated into a file Create a spreadsheet with a drop list for each name containing the alternatives for that name ("Aa filamentosa M.L.Ortiz, 1937", or "Aa filamentosa Mansf.") The solution will be developed in Python using the xlsxwriter package.

FedorSteeman commented 5 months ago

Who has the ball on this issue?

jlegind commented 5 months ago

There is a utility ready now. It is a Excel sheet with names , storage location and dropdowns for each name. I am submitting this to Bhupjit and Pip for review.

jlegind commented 5 months ago

SUPERCEDED BY https://github.com/NHMDenmark/Mass-Digitizer/blob/main/Author_backfill/read_me.md

Workflow on producing the spreadsheet:

Create a csv download from Specify UI. Mapping is as follows:

author_util_mapping Make sure that 'project' is set to DaSSCo

Import this csv as a table into the SQLite database that the Mass Digitization App made and name it 'binomial'. The code for the utility is specifically written for this scenario.
Create a new table 'binomial_id' by employing this SQL string: CREATE TABLE binomial_id AS SELECT DISTINCT tw.genus, t.name, tw.box, tw.binomial, t.author, tw.taxonid FROM taxonauthor_storage_id tw JOIN taxonname t ON tw.binomial = t.name WHERE length(t.author) >= 1; Now the basic elements are in place.

FedorSteeman commented 5 months ago

I discussed with @jlegind that we'd also need to include taxon table primary key, i.e. taxon ID, so the results can be easily used to update the taxon records with the assigned author name. In doing so, we may need to be wary of any cases where the same taxon record is assigned different author names depending on the specimen in question.

jlegind commented 4 months ago

Taxon identifiers are added to the spreadsheet utility. @PipBrewer @bhsi-snm authorDropdown.xlsx Feel free to test the spreadsheet.

jlegind commented 4 months ago

How about we find time to present this to the digitizers?

chelseagraham commented 4 months ago

The digitizers have begun filling in this sheet. The corresponding GitHub ticket for the Herbarium Sheet Digitization project is https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/74 :)

RebekkaML commented 2 months ago

The author name sheet has been filled and https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/74 has been closed. There are, however, some remaining issues that need to be fixed before this can be imported into Specify again.

Those Issues are: https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/93 https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/90 https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/92 https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/91

NHMDenmark / Mass-Digitizer