NHMDenmark / Herbarium-Sheets-workstation

Workstation and workflows for herbarium sheets for mass digitisation (DaSSCo)
0 stars 0 forks source link

Update taxonomy information for box 1-285 in Specify #117

Open RebekkaML opened 3 months ago

RebekkaML commented 3 months ago

At the beginning of digitizing at Herbarium C, no author names or taxon information below species level (hybrids, subspecies, variety, forma) were recorded. This was only started with box 286, so this information needs to be added to all entries from box 1 - 285.

For this, a spreadsheet was filled in with the author information for each taxon (#74 ). Another sheet was filled in for any further taxonomic information (Hybrids, subspecies etc.) (#91 ).

The information from these 2 spreadsheets is now collected in a large table located on the N-Drive: : "N:\SCI-SNM-DigitalCollections\DaSSCo\Workflows and workstations\Herbarium\Infraspecies spreadsheet\Infraspecies_table_filled_in.xlsx"

Before this information can be uploaded to Specify, some last issues need to be resolved:

Once these Issues are resolved, we can plan how to import the missing information into Specify.

RebekkaML commented 3 months ago

The Issue Resolve Infraspecies spreadsheet notes before import. #114 was resolved by deciding to leave the taxonomy comments in for now and resolve these things after the import into Specify. This means that also the column "comments" needs to be imported, not just author names and subspecies / Hybrids etc.

RebekkaML commented 3 months ago

The related issues have been resolved and the updated and cleaned file is this: Infraspecies_table_filled_in.xlsx

It can be found here: "N:\SCI-SNM-DigitalCollections\DaSSCo\Workflows and workstations\Herbarium\Infraspecies spreadsheet\Infraspecies_table_filled_in.xlsx"

The columns "subspecies _old" and "variety_old" refer to information that is already in Specify, in case this is important to distinguish.

The new information that needs to be imported is "Subspecies", "Subspecies_Author", "Variety", "Variety_Author", "Forma", "Forma_Author", "Hybrid_parent_1", "Hybrid_parent_1_Author, "Hybrid_parent_2", "Hybrid_parent_2_Author and "Comment".

The table also includes the Collection Object ID and current taxon ID for each specimen.

beckerah commented 1 month ago

I had a chat about this with Fedor, and he confirmed that there's no way to update records via workbench, which means we have two options:

  1. Manually update each record (clearly we're not actually going to do this)
  2. Update the records via the API

In order to use the API, we'll need to put together a script. This is going to require quite a bit of legwork, as I'll need to test the API calls and figure out all the primary & foreign keys, what to do about validation, etc. Bhupjit has already sent me a list of resources for playing around with this, which I'm tracking here: NHMDenmark/Projects/DaSSCo digitisation data/Research Specify API.

beckerah commented 1 month ago

Since Joaquim is already working on a script to update records in Specify via the API, (for the transcription app,) I can piggyback off of his efforts. I talked to him briefly about it on Slack and asked if he knew when that part would be ready. Here was his response:

I have not started working on it yet, but that's the plan. I should start working on it in a couple weeks, depending on how much changes will be needed on the transcription platform. The first approach is to prepare a script that can ingest "formatted" data and push it into specify using the API. This should be achieved in a few weeks. Then it could be extended to have an interface, and maybe allow users to allow some mapping of fields, and decide behaviours for conflicts, such as overwrite and ignore. Depending on the level of complexity it could take a bit more time, but desirably before the end of the year

Pip says this can wait, as it's lower priority than keeping digitization going, and developing new data pipeline for AU.