CatalogueOfLife / data

Repository for COL content
8 stars 2 forks source link

Inconsistencies in World Plants data and representation of hybrid taxa #398

Open Australis86 opened 2 years ago

Australis86 commented 2 years ago

Describe the problem: The Darwin Core Archive exports for the Amaranthus genus (and likewise the COL webpages) do not indicate that some of the taxa are hybrid taxa except by the hybrid symbol in the scientificName. The taxonRank is given as species, rather than nothospecies, and the taxonRemarks field (which usually includes a note to say "Hybrid taxon" and the parentage, if available) are empty. Looking at the upstream datasource, the parentage is given for at least one of the hybrid taxa (Amaranthus ozanonii).

The inclusion of the hybrid symbol makes searching for these entries via the API difficult, as most other genera (e.g. Cymbidium from KEW WCSP) do not include this. I'd much prefer if the representation of natural hybrids was kept consistent by excluding the use of the hybrid symbol, correctly setting the rank and including the parentage where available in the taxon remarks. This makes it much easier to work with the dataset and the API.

Link to effected CoL webpages: https://www.catalogueoflife.org/data/taxon/84JPQ

mdoering commented 2 years ago

Hybrid taxa are represented in 2 different ways in ChecklistBank. If they are hybrid formulas we do not parse them and instead keep the entire formula in the scientificName and indicate this by setting Name.type=HYBRID.

In case of named hybrids we do parse them, but do not want to have specific ranks for notho taxa. Instead we use the additional notho field to indicate them. and have a special field notho that specifies which part of the name is to hold the hybrid marker, i.e. at which rank the hybridisation occurred. In your example this is notho=species: https://api.catalogueoflife.org/dataset/3LR/taxon/84JPQ

"scientificName":"Amaranthus × ozanonii",
"authorship":"Priszter",
"rank":"species",
"genus":"Amaranthus",
"specificEpithet":" ozanonii",
"notho":"specific",
"combinationAuthorship":{
  "authors":["Priszter"]
},
"code":"botanical",
"origin":"source",
"type":"scientific",
"remarks":"Hybrid taxon. Binomial generated by Species 2000.",
"parsed":true

Can you elaborate about the search problem? As we have properly parsed names the hybrid marker is not really in the way: https://www.checklistbank.org/dataset/1141/names?q=Amaranthus%20ozanonii

As for the remarks I don't know how these are generated. We simply copy all remarks from the source to the COL Checklist. But the remark is already like this in the source dataset in ChecklistBank. @gdower I assume you generate those comments in your ColDP script?

Australis86 commented 2 years ago

Thanks for the quick response. That is not quite the structure of the data I am seeing in the DwCA exports (e.g. from https://api.checklistbank.org/dataset/[key]/export). For example, this is the row for ozanonii from the Taxon.tsv file produced for Amaranthus:

dwc:taxonID 84JPQ
dwc:parentNameUsageID   T8M
dwc:acceptedNameUsageID 
dwc:originalNameUsageID 
dwc:scientificNameID    1e7e47fd-4350-4a8a-8cf6-4438037218a7
dwc:datasetID   1141
dwc:taxonomicStatus accepted
dwc:taxonRank   species
dwc:scientificName  Amaranthus ×  ozanonii Priszter
dwc:scientificNameAuthorship    Priszter
dwc:genericName Amaranthus
dwc:infragenericEpithet 
dwc:specificEpithet ozanonii
dwc:infraspecificEpithet    
dwc:cultivarEpithet 
dwc:nameAccordingTo 
dwc:namePublishedIn 
dwc:nomenclaturalCode   ICN
dwc:nomenclaturalStatus 
dwc:taxonRemarks    
dcterms:references  http://www.worldplants.de/?deeplink=Amaranthus-ozanonii

So the only indication in the DwCA dataset that this is a hybrid taxon is in the scientificName field (where it has two spaces after the symbol). This is inconsistent with other data sources such as the KEW WCSP (which provides the Orchid family amongst others), which doesn't have the hybrid symbol in the name but does indicate hybrid taxa by the taxonRemarks column.

If I then use the nameusage API (https://api.checklistbank.org/dataset/3LR/nameusage/search), the only way to get this nothospecies is to either set type to "WHOLE WORDS" (which then introduces at least one of its subspecies) or include the hybrid symbol in the search set type to 'EXACT'. Normally I send the following to the API:

params = {'q':search_term, 'content': 'SCIENTIFIC_NAME', 'maxRank':'SPECIES', 'type': 'EXACT', 'offset':0, 'limit':10}
r = self._session.get("https://api.checklistbank.org/dataset/3LR/nameusage/search", params=params, headers={'accept': 'application/json'})

On the other hand, if I were to do this with any of the orchid nothospecies, the above would work flawlessly as the hybrid symbol is not included (e.g. Cymbidium baoshanense).

Essentially I'd just like there to be a consistent way to identify nothospecies in the DwCA exports, rather than need to accomodate different formats depending on the upstream datasource.

Let me know if you need more information.

mdoering commented 2 years ago

DwC is not an ideal format for taxonomic data. There has been an issue for dealing with notho taxa that has been closed because of lack of demand: https://github.com/tdwg/dwc/issues/43

This would be solved when using the ColDP archive which aligns better with our API. I guess we can add an unofficial col:notho term to the dwc export until there is an official dwc term. Would that help?

The Kew species C. baoshanense) does only mention the hybrid nature in the remarks. That is completely opaque to use so we treat it just as a regular species. This data is rather old and will be replaced by a newer version from Kew soon (@yroskov ?). In the newer WCVP and WCSP datasets the hybrid is correctly marked up:

mdoering commented 2 years ago

See also https://github.com/CatalogueOfLife/general/blob/master/docs/NAMES.md#named-hybrid I just realize there also is no notho term in ColDP. We anticipated that the hybrid symbol is prefixing the respective part of the name. I would be in favor of adding a new col:notho term like we have in the API.

Australis86 commented 2 years ago

Thanks Markus. I have no experience with the ColDP format, but quickly looking at it suggests it may be far more useful to me going forward (alas, I will need to overhaul a chunk of code, but it looks like it will be worth it in the long run). All my scripts were built around the DwC format years ago and I wasn't aware of the development of the ColDP, so I appreciate you bringing it to my attention.

I am looking forward to the KEW WCSP dataset being updated and I appreciate the heads up as to which way this is going to go (the inclusion of the hybrid symbol), as I will need to update my scripts to accommodate this. I am glad that the outcome will be more consistent.

The inclusion of a notho field in the DwC (as an interim measure until I can rewrite to use ColDP) and ColDP formats would be greatly appreciated (and preferable over including the symbol in the corresponding epithet field), as I often use the individual fields with the database I work with rather than parsing the scientificName (the system I work with does not accommodate the author, only the botanical Latin name).

If it is straightforward to make the namesearch API insensitive to the presence of the hybrid symbol, that would be much appreciated.

mdoering commented 2 years ago

for the namesearch I dont think I can make that happen easily. But I will be working on a new matching method in the API that would be the recommended way to look up a single matching name in any dataset. It will be based on the NamesIndex and should allow for small variations like the hybrid marker and epithet gender.

gdower commented 2 years ago

@gdower I assume you generate those comments in your ColDP script?

Yes, we add that comment.

mdoering commented 2 years ago

@gdower I assume you generate those comments in your ColDP script?

Yes, we add that comment.

Does that mean the binomial does not exist in World Plants? It appears just like that here: https://www.worldplants.de/world-plants-complete-list/complete-plant-list/?name=Amaranthus-ozanonii

Hybrid taxon. Binomial generated by Species 2000.

Maybe we should replace Species 2000 with Catalogue of Life?

Australis86 commented 2 years ago

for the namesearch I dont think I can make that happen easily. But I will be working on a new matching method in the API that would be the recommended way to look up a single matching name in any dataset. It will be based on the NamesIndex and should allow for small variations like the hybrid marker and epithet gender.

Thanks Markus. I look forward to that becoming available.

gdower commented 2 years ago

Maybe we should replace Species 2000 with Catalogue of Life?

I'll remove the binomial comment. It made sense in the old web interface that couldn't render the × symbol for hybrids, although maybe I should keep the Hybrid taxon remark.

mdoering commented 2 years ago

@Australis86 I have deployed the change to our dev environment and downloaded the Poaceae family from VASCAN with many hybrids and the new notho field in both DwC-A and ColDP downloads. Will be on prod soon.

Australis86 commented 2 years ago

@mdoering Thanks Markus. I'll use those samples to do some testing.