NHMDenmark / Mass-Digitizer

Common repo for the DaSSCo team
Apache License 2.0
1 stars 0 forks source link

Accommodate homonyms in DigiApp #434

Closed bhsi-snm closed 5 months ago

bhsi-snm commented 9 months ago

Template for isues/tickets in DigiApp

What is the issue ?

# Detailed description of the issue. 
Homonyms are supposed to identified during digitsation and they can only be done 
through Author name and Year but currently DigiApp doesnt show that information 
when selecting the taxon name. 
We need to add that so that digitisers can select the right taxonname and identify 
between homonyms.

Why is it needed/relevant ?

# Explain the need, relevance.
We are digitising the vascular plants collection which falls under botany. Botany has 
homonyms which means when we are digitising, sometimes we have not corrected
entered the taxon name or potentially we need to be able to differentiate between
the homonym taxon name which specify does via either Specify Id(taxonspid??) or 
through Author name and Year. But as in our current GUI we do not have fields to 
populate that information, digitisers have no way to distinguish between the 
homonyms. So, for homonyms to be incorporated into DigiApp, there are 2 solutions.

- EITHER - it needs to be extended with Author name and Year

- OR Specify API such that we get Specify id based identification of homonyms(I am 
bit unclear, on how that would work so maybe, @FedorSteeman or @PipBrewer
would explain that).

Estimate level of effort required.

If we go with Author name and Year it is # difficult
but if we go to integration through API, it is substantial work, 
worth a month (just a guess could be more) .

What is the expected acceptable result.

#  How to approach it?
#  Give a clear approach/potential solution on how to resolve it.
#  What steps would be required to do this ?
#  It might also be an idea to put some pseudocode if relevant.

What could be the challenges ?

Is there a potential risk to this. 
Could it affect another part of the project. 

What test are required ?

New tests/Could include reference to the existing test

What documentation required?

Could refer to existing documentation and changes in relevant doc files.

Remarks

PipBrewer commented 9 months ago

Information provided by Zsuzsanna Papp is that in Botany, homonyms are common within a family. The original idea was to have a partial taxonomic hierarchy visible to the digitiser (similar to the storage field); however, if they are common within families, this would not resolve this. The solution is to create column(s) for Author and year (remember that commas and brackets etc are important here) in taxon table. This should be visible when selecting taxon in UI. I'm not sure how much of Botany spine has author and year in Specify. May need to get that using GBIF API.

FedorSteeman commented 9 months ago

This may take a day or two to implement.

FedorSteeman commented 8 months ago

Major issue here is that, when it comes to botany, author information was not added to Specify from the taxonomy source, so will have to be added after the fact.

FedorSteeman commented 8 months ago

It is important that this is solved, because it appears that the lack of authorship in the app taxon spine is creating duplicates in Specify upon import in cases where those same taxa do have authorship set. This means that Workbench will not match those taxa and create a new one without authorship. This is bad, because these duplicates will then feed back into the taxon spine of the app db.

One problem with adding authorship to the fullname is that this will screw up the algorithm for guessing taxon rank. Work in progress.

FedorSteeman commented 8 months ago

Fortunately, the taxon trees for the two Entomology collections did already possess authorship, and it was easy to transfer those to the app. For NHMD Vascular Plants, the authorship will still need to be added from a source data set.

Latest taxonomy can be fetched here: https://www.checklistbank.org/dataset/53147/download?taxonID=7707728

Taking a look at the source for the taxon spine for vascular plants I can see two issues:

FedorSteeman commented 8 months ago

Just trying and testing in the test db; Updating taxa with author goes fine, but I found my first homonym that evades this effort:

So I need to find a way to add homonyms after the fact, but first I need to drop fossil taxa from the taxonomy...

FedorSteeman commented 8 months ago

Unfortunately, GBIF does not mark taxa as extant or extinct in their taxon spine export products.

However, I have found a way to wrangle OpenRefine to fetch data for each row from paleobiodb. It's just really slow, but nevertheless progressing: image

When paleobiodb data is fetched for each taxon, I can parse the resulting json to mark the different rows as extant or not. This way we can leave out fossil taxa.

value.parseJson().records[0].get('ext')

FedorSteeman commented 5 months ago

Homonyms are now accommodated by the app, but we need to test whether these actually get through post-processing and into Specify via WorkBench. @jlegind is tasked with testing this.

jlegind commented 5 months ago

A test dataset was created with taxon names plus author name:

Delphinium bucharicum Popov
Delphinium carela Buch.-Ham. ex D.Don
Legouixia Van Heurck & Müll.Arg.
Legousia snogerupii Biel & Kit Tan

The author name was not processed in the GREL script part of post processing which means that only the binomial was transferred to test Specify. Example: https://specify-test.science.ku.dk/specify/view/collectionobject/4369407/

A solution would be to add an 'Author' column to the Specimen table. This would enable mapping to author name in Workbench.

FedorSteeman commented 5 months ago

@jlegind Can I see the post-processed file of this dataset?

And the pre-processed too, so I can attempt to replicate?

FedorSteeman commented 5 months ago

Although not in itself related to the GREL script, it does make sense, however, that we need the author field in the specimen table so we can map that value in Workbench.

FedorSteeman commented 5 months ago

I did not notice that @jlegind created a new ticket #476 for the specific issue that I just fixed within the scope of this ticket.

I will try to tie these tickets together somehow.

Tickets will be closed and can be reopened if necessary depending on testing results.

FedorSteeman commented 5 months ago

To be tested with pre-release: https://github.com/NHMDenmark/Mass-Digitizer/releases/tag/v1.1.26

NOTE: We have not yet considered the author name of any new taxa...

jlegind commented 5 months ago

Comments on "Author names not carried over in the post processed file (GREL)" https://github.com/NHMDenmark/Mass-Digitizer/issues/476#issuecomment-1931523421

FedorSteeman commented 5 months ago

Superseded by #476