Open jlegind opened 6 months ago
This will need to be fixed eventually, but not necessarily ríght now as part of the sprint towards v1.2. Will draft these and other issues with @bhsi-snm
Suggested Solution:
A solution could be to have a spreadsheet with drop downs for each name having been imported that has more than one authored names. The drop list contains the authors applicable to that name. The process would be:
To query for the records that were already imported into Vascular plants in Specify. Match those names to the new taxonomy where author names exist. Those names existing more than twice with different author names are separated into a file Create a spreadsheet with a drop list for each name containing the alternatives for that name ("Aa filamentosa M.L.Ortiz, 1937", or "Aa filamentosa Mansf.") The solution will be developed in Python using the xlsxwriter package.
Who has the ball on this issue?
There is a utility ready now. It is a Excel sheet with names , storage location and dropdowns for each name. I am submitting this to Bhupjit and Pip for review.
SUPERCEDED BY https://github.com/NHMDenmark/Mass-Digitizer/blob/main/Author_backfill/read_me.md
Workflow on producing the spreadsheet:
Make sure that 'project' is set to
DaSSCo
CREATE TABLE binomial_id AS SELECT DISTINCT tw.genus, t.name, tw.box, tw.binomial, t.author, tw.taxonid FROM taxonauthor_storage_id tw JOIN taxonname t ON tw.binomial = t.name WHERE length(t.author) >= 1;
Now the basic elements are in place.I discussed with @jlegind that we'd also need to include taxon table primary key, i.e. taxon ID, so the results can be easily used to update the taxon records with the assigned author name. In doing so, we may need to be wary of any cases where the same taxon record is assigned different author names depending on the specimen in question.
Taxon identifiers are added to the spreadsheet utility. @PipBrewer @bhsi-snm authorDropdown.xlsx Feel free to test the spreadsheet.
How about we find time to present this to the digitizers?
The digitizers have begun filling in this sheet. The corresponding GitHub ticket for the Herbarium Sheet Digitization project is https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/74 :)
The author name sheet has been filled and https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/74 has been closed. There are, however, some remaining issues that need to be fixed before this can be imported into Specify again.
Those Issues are: https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/93 https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/90 https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/92 https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/91
What is the issue
The app will be updated with vascular plants taxon spine to includes author names. However, the already imported specimen records did not have author associated with the taxon name. We have to retroactively add the author name to records where that name is available in the specimen image. There are ~ 15.000 vascular plant records as of 19-12-2023 in Specify. The period is May 2023 to December 19. 2023. These records will eventually all have an image associated.
Description
The imported specimen records ought to have the author name along with the taxon name if available. The author is often written after the taxon name on the label and might thus be retrieved from the image associated with the specimen. Unless we have a pipeline that can read the author name, this will require a lot of manual work. The images can be identified by the associated barcode, yet that has to be read via a pipeline. When that pipeline goes operational we can go from the specimen record straight to the image and look for the author name so that can be added. Waiting for the pipeline to be ready is the only feasible way forward.
Why is it needed/relevant ?
There is the issue of homonyms and disambiguation which author names can help solve.
Estimate level of effort required.
Hard
How to approach it?
It would be preferable for the pipeline to identify the barcodes for the images and then rename those images with the barcode. This enables us to go specimen by specimen. If a pipeline can read the authors - or at least some of the authors - that would be tremendously helpful. Reading handwritten author names and adding them to the taxonomic name is yet another issue (separate though).