SpeciesFileGroup / taxonworks

Workbench for biodiversity informatics.
http://taxonworks.org
Other
85 stars 25 forks source link

DwC-A: cannot import occurrence data for existing species #2581

Open sergeitarasov opened 2 years ago

sergeitarasov commented 2 years ago

As an experiment, I tried to import (many times with different modifications) one occurrence record using DwC-A (attached) with ‘restrict record to existing nmcl’ to match the species that is already in TW. But it does not work:

Protonym Parachorius not found with that name and/or classification. Importing new names is disabled by import settings.

I wonder how can I fix it? :)

DwC_Parach.xlsx ?

LocoDelAssembly commented 2 years ago

Problem seems to be that the importer is trying to locate Parachorius thomsoni Harold, 1873 directly under root since the dataset is not providing higher classification.

I could probably change the way existing names are located when restricting is enabled @mjy? This would mean ignoring higher classification if the name cannot be found in the provided parents path (would also fix cases when the dataset does not agree with the existing classification in the database). This requires a bit significant algorithm change.

mjy commented 2 years ago

I think that he doesn't want to restrict, he wants them to be created, correct @sergeitarasov? He is just missing an option somewhere?

proceps commented 2 years ago

I can envision many cases where classification does not match. I would say, that the classification should be ignored. The name string should match. There could be some issues so. For example we have both Protonym and Combination with the same ScientificName. We may also have homonyms. In some cases a manual resolution would still be required.

proceps commented 2 years ago

Following @mjy, my understanding, that @sergeitarasov specifically restricted creation of new names, he wants to link specimen records to the existing classification. That would be the requirements in most of the cases when we import data to 3i Auchenorrhyncha project as well.

mjy commented 2 years ago

We should definitely maintain the mode where import only succeeds when the hierarchy fully matches as an option, and the default. Having an alternate mode where name matches OTU#name or Otu.taxon_name.cached only (and there is only one match), as you imply, is also useful.

sergeitarasov commented 2 years ago

Following @mjy, my understanding, that @sergeitarasov specifically restricted creation of new names, he wants to link specimen records to the existing classification. That would be the requirements in most of the cases when we import data to 3i Auchenorrhyncha project as well.

Yep, that's correct @proceps, I would like to link the records to the existing sp. This going to be the most frequent task for me to import with DwC-A. Do you have any idea of how I can fix it now? Adding a higher-level taxon on the csv?

mjy commented 2 years ago

Yes, adding higher level will make it work. It must match all the way up.

proceps commented 2 years ago

It is hard to envision managing classification in DwC. We have a table for 250 holotypes (just species name), updating the higher classification all the way up, this this would be a job comparable to creating collection objects manually using Comprehensive task.

sergeitarasov commented 2 years ago

Current classification on TW: 'Parachorius thomsoni -> Parachorius-> Parachoriini-> Scarabaeinae-> Root' I added 'tribe' and 'subfamily' but the import still returns the same error. Does that mean that I need to change the current classification to include the entire taxonomic path (all the way to Animalia)?

mjy commented 2 years ago

Not too hard. Pre-step is to build something like geographic name matcher service we have. You paste in one column of names, you get the higher matching names back, you paste those into your columns.

Again, both modes are warranted, I'm not debating that, but people importing data from diverse datasets are going to want the strict mode as well.

mjy commented 2 years ago

This is literally the challenge everyone wants to solve "trivially", which is anything but trivial when you want to make many decisions about your data.

For your data @proceps you have already pre-validated it all, this is different from others bringing in data that they haven't looked at.

sergeitarasov commented 2 years ago

I added 'higherCLassification': Scarabaeinae|Parachoriini|Parachorius Now the error: ’Protonym thomsoni not found with that name and/or classification. Importing new names is disabled by import settings.’

LocoDelAssembly commented 2 years ago

Which project is this? Is it in production?

sergeitarasov commented 2 years ago

Which project is this? Is it in production?

Yep, in production 'Dung_Beetles'

LocoDelAssembly commented 2 years ago

Getting production database into my development machine for testing. Will take me around 15 mins to setup.

LocoDelAssembly commented 2 years ago

@sergeitarasov Sorry late reply, had some problems with the database and long meeting afterwards.

I tried with the attached spreadsheet and it had no trouble handling the name, but complained that the repository referenced in institutionCode with acronym AN does not exist (I changed it to something else to test): image image

Spreadsheet used: DwC_Parach.xlsx (higherClassification added on the far right)

sergeitarasov commented 2 years ago

Thanks @LocoDelAssembly! I tried your file and fixed the institution acronym. It still does not work for me though, if I restrict the import to 'Restrict import to existing nomenclature only'. The error is the same -- cannot find the partonym thomsoni. However, if I do not restrict then it works but creates another Parachorius thomsoni within the existing Parachorius.

LocoDelAssembly commented 2 years ago

@sergeitarasov right, sorry. On first try I forgot to use existing nomenclature, so it created the duplicate, but on second try I enabled and still succeeded. Investigating why is happening...

image

sergeitarasov commented 2 years ago

Second try works for me too (with the restriction). TW adds the records to P. thomsoni that was previously imported with DwC-A but not to the original P. thomsoni.

LocoDelAssembly commented 2 years ago

Found the problem. The importer was expecting authorship and year as data in the taxon name itself (as that is the way it creates them), but was failing in this case because authorship is derived from original citation. Fixed by matching by rank and name only, now it matched existing name: image image image

LocoDelAssembly commented 2 years ago

@sergeitarasov forgot to mention, this fix won't be available until we release 0.20.1. Please be sure to delete the duplicate name created by the importer to avoid confusion.

sergeitarasov commented 2 years ago

Works for me now! Thanks for the help @LocoDelAssembly :) Two quick Qs:

  1. Does that mean that the import for taxon name linked to author citation vs. linked to verbatim authroship is different at the moment?
  2. When approximately TW 0.20.1 will be realesed?
mjy commented 1 year ago

Can we close this?