NHMDenmark / Mass-Digitizer

Common repo for the DaSSCo team
Apache License 2.0
1 stars 0 forks source link

Sychronization issues in the botany taxon spine #471

Open jlegind opened 8 months ago

jlegind commented 8 months ago

What is the issue ?

There is a risk that novel names could show up duplicated in the Specify taxonomy after they are imported through Specify workbench. Also we need to check previous names imported into the botany spine.

Estimate level of effort required.

Hard

What is the expected acceptable result.

That no duplicates are added to the Taxonomy. Existing names are differentiated by author where appropriate.

Give a clear approach/potential solution on how to resolve it.

During the post processing step, new names are identified. They will show up in Open-Refine under the 'remarks' column. The pattern is this: | Verbatim_taxon: [the taxon name itself].
The post processed file is imported.
After this, the novel names can be explored through the taxon tree and duplicates can be merged according to this guide : N:\SCI-SNM-DigitalCollections\DaSSCo\Specify taxonomy cleaning\Cleaning taxonomy using the taxon tree in Specify 7.docx

Updating authored names post hoc

A solution could be to have a spreadsheet with drop downs for each name having been imported that has more than one authored names. The drop list contains the authors applicable to that name. The process would be:

  1. To query for the records that were already imported into Vascular plants in Specify.
  2. Match those names to the new taxonomy where author names exist.
  3. Those names existing more than twice with different author names are separated into a file
  4. Create a spreadsheet with a drop list for each name containing the alternatives for that name ("Aa filamentosa M.L.Ortiz, 1937", or "Aa filamentosa Mansf.")
  5. The solution will be developed in Python using the xlsxwriter package

What could be the challenges ?

If novel names are very close to an existing name like:

Abacopteris menisciicarpa (Blume) Holttum

Abacopteris menisciicarpos (Blume) Holttum

or

Aa mathewsii (Rchb.f.) Schltr. Aa matthewsii (Rchb.f.) Schltr.

What should the process be in these cases?

What test are required ?

A Levenshtein distance test could be created to see if the name is very close to another. Unfortunately SQLite does not support this feature out of the box, but there are extensions that could be employed.