SpeciesFileGroup / taxonworks

Workbench for biodiversity informatics.
http://taxonworks.org
Other
85 stars 25 forks source link

IdentificationQualifier is not mapped as DwC import field #2430

Open LordFlashmeow opened 3 years ago

LordFlashmeow commented 3 years ago

This issue was discussed here: https://github.com/SpeciesFileGroup/antweb-staging/issues/8 but there was no resolution.

What are the correct fields for a scientific name like Camponotus mgo1 (where mgo1 is the IdentificationQualifier)? The current scientificname parser breaks for names like this.

mjy commented 3 years ago

When the name breaks, then the OTU#name must be populated. In this example Camponotus mgo1 would be the OTU name. If this is not the case in current import then its a top candidate for 0.20.1 rather than a blocker I think.

mjy commented 2 years ago

@LocoDelAssembly @LordFlashmeow can this be closed?

LocoDelAssembly commented 2 years ago

This is the current mapping we have for it:

    ident_qualifier = get_field_value(:identificationQualifier)
    if ident_qualifier =~ /^cf[\.\s]/
      otu_names << ident_qualifier
    else
      otu_names << "#{get_field_value(:scientificName)} #{ident_qualifier}"
    end unless ident_qualifier.nil?
    names.last&.merge!({otu_attributes: {name: otu_names.join(' ')}}) unless otu_names.empty?
LordFlashmeow commented 2 years ago

Probably. I'll reopen if I encounter the issue on the next big import.

LocoDelAssembly commented 2 months ago

The way this was implemented conflicts with Restrict to existing nomenclature feature. For instance if you have a scientific name that is invalid like Jivarus ali3nus, it is matched with Jivarus protonym, and Otu.name is set to Jivarus ali3nus. Desired result would be to FAIL, and even with restriction disabled is still questionable that the importer accepts the scientificName. We discovered this problem while importing datasets with some bogus scientific names into a private copy of OSF (fortunately was solvable by deleting the OTUs and associated data with a script in rails console).

Biodiversity::Parser.parse("Jivarus ali3nus")
=> 
{:parsed=>true,
 :quality=>4,
 :qualityWarnings=>[{:quality=>4, :warning=>"Unparsed tail"}],
 :verbatim=>"Jivarus ali3nus",
 :normalized=>"Jivarus",
 :canonical=>{:stemmed=>"Jivarus", :simple=>"Jivarus", :full=>"Jivarus"},
 :cardinality=>1,
 :tail=>" ali3nus",
 :details=>{:uninomial=>{:uninomial=>"Jivarus"}},
 :words=>[{:verbatim=>"Jivarus", :normalized=>"Jivarus", :wordType=>"UNINOMIAL", :start=>0, :end=>7}],
 :id=>"5b4c5fe6-8c4c-5f5d-9238-ad9c242d5560",
 :parserVersion=>"GNparser v1.9.1"}

Do your DwC datasets use identificationQualifier so that we can restrict what the "unparsed tail" in the name parser can be considered valid?

cc @LordFlashmeow @bpescador @AntWeb-org @mjy @mabecabrera

Problematic line of code: https://github.com/SpeciesFileGroup/taxonworks/commit/d922b69d9a571790fd362aeb182361998d5f8c57#diff-49f1423594fe8c44666b568f77142aaead2f6a2796b0e4458895bd8e62e3755eR793 (793 if anchors fails)

bpescador commented 2 months ago

In Antweb data, any name with a non alpha characters is a morphotaxon (OTU in TaxonWork speak). Non alpha characters are restricted to numbers and "-"

LocoDelAssembly commented 2 months ago

That's OK, but do you also put the non alpha characters in identificationQualifier when using the importer? I see that in ant_formicidae dataset you do, and in fact you don't place the non alpha words in scientificName. Are you always doing it like this? If so I could revert back to stricter quality checking of parsed names in scientificName (which would easily solve the conflict problem), and use identificationQualifier to compose the Otu.name.

bpescador commented 2 months ago

I think we tried to follow the GBIF DwC guidelines - let me know if you think we did it the wrong way. I find the DwC approach to OTU names unnecessaryly confusing.

On Thu, Jul 4, 2024 at 4:49 PM Hernán Lucas Pereira < @.***> wrote:

That's OK, but do you also put the non alpha characters in identificationQualifier when using the importer? I see that in ant_formicidae dataset you do, and in fact you don't place the non alpha words in scientificName. Are you always doing it like this? If so I could revert back to stricter quality checking of parsed names in scientificName (which would easily solve the conflict problem), and use identificationQualifier to compose the Otu.name.

— Reply to this email directly, view it on GitHub https://github.com/SpeciesFileGroup/taxonworks/issues/2430#issuecomment-2209047610, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP4NHA5QMKNOMAHEFJBEOTZKVHE7AVCNFSM6AAAAABKKLGBIOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBZGA2DONRRGA . You are receiving this because you were mentioned.Message ID: @.***>

LocoDelAssembly commented 2 months ago

Found this discussion: https://github.com/tdwg/dwc-qa/issues/162

I believe you've done right, and if you continue using identificationQualifier to place the morphospecies part of the scientific name I can just make scientificName parser strict again. Also, we may consider https://dwc.tdwg.org/terms/#dwc:verbatimIdentification (currently not mapped in importer) which was referenced in above issue and discussed at https://github.com/tdwg/dwc/issues/181

Maybe when verbatimIdentification is present, use it for Otu.name instead of scientificName + identificationQualifier? (Still leaving Otu.name blank when only scientificName is provided)