Closed bart-v closed 1 year ago
Above link is using UAT (User acceptance testing). The result might be the same, but test environment and production environment is likely to be different much of the time.
I suggest that you use the production site https://www.gbif.org instead
(UAT and the production environment are usually very similar for record interpretation.)
@mdoering can help explain what's happening here. I can see we have the name from the WoRMS checklist: https://www.gbif.org/species/105760798 but it isn't linked in to our taxonomic backbone. This one from a different checklist is, but it loses the author.
We don't yet match using the scientificNameId, we should look to support this for the ingestion pipeline rewrite (in progress this year).
Adding a kingdom will make this match to the correct name.
yes, scientificNameId is pretty much ignored in both occurrence and checklist processing. With a a scientificNameId value from occurrences pointing to another checklist or even something outside of GBIF indexed data it will not become a simple exercise. All occurrences must link to a Backbone species, not some other checklist. For that to happen the backbone would a) have to have such a species (with that authorship) and b) know about the global scientificNameIds used in other lists. Maybe sth for when we have switched the backbone to use CoL+
Most of the links on this issue are deprecated. Plus, as far as I understand, this is not something that we can fix. Could we close this issue?
This is not about dead links, but about the fileld dwc:scientificNameID being ignored by GBIF I thinks it's a very important issue
Agree this is an important issue, especially for OBIS node contributions. This really is a missed opportunity for GBIF as OBIS nodes take great care in assigning an appropriate scientificNameID to each occurrence. Would hate to see any records from the OBIS-USA node end up as terrestrial species when we've taken the time to provide the marine representation.
In that case, should it be an issue for the CoL+? https://github.com/Sp2000/colplus
I am wondering about a few things here:
why does the name have a classification? dwc:scientificNameID should point to nomenclatural information. Taxon concepts and classifications would be dwc:taxonConceptID or even just dwc:taxonID
We never dealt with DwC archives pointing to external information. All archive IDs can be resolved locally within the archive. This is not true for dwc:scientificNameID
For resolving external IDs there is no standard format, protocol or anything alike. Its quite a burden to know all variations in advance and issue http calls to resolve each ID.
Is there really extra information in the linked name data that would help us to better interpret the name & its classification? Isnt all that information already given in the DwC occurrence record?
Looking at one of the Oligochaeta Koch examples I see the taxonomic dwc occurrence information is very sparse: https://www.gbif.org/occurrence/1324564024 It is just the name, not even a rank, kingdom or anything else. The ID would have made a difference here. But would it be difficult to enrich the occurrence data?
http://lsid.info/urn:lsid:marinespecies.org:taxname:2036
<?xml version="1.0"?><rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:dwc="http://rs.tdwg.org/dwc/terms/"
>
<rdf:Description rdf:about="urn:lsid:marinespecies.org:taxname:2036">
<dc:type>ScientificName</dc:type>
<dc:date>2019-10-03</dc:date>
<dc:subject><![CDATA[Oligochaeta Grube, 1850]]></dc:subject>
<dc:title><![CDATA[Oligochaeta]]></dc:title>
<dc:relation><![CDATA[http://www.marinespecies.org/aphia.php?p=taxdetails&id=2036]]></dc:relation><dc:creator><![CDATA[Timm, Tarmo]]></dc:creator><dc:creator><![CDATA[van Haaren, Ton]]></dc:creator><dc:identifier>urn:lsid:marinespecies.org:taxname:2036</dc:identifier>
<dc:publisher>World Register of Marine Species (WoRMS)</dc:publisher>
<dc:license>http://creativecommons.org/licenses/by/4.0/</dc:license>
<dc:language>en</dc:language>
<dcterms:bibliographicCitation><![CDATA[WoRMS (2019). Oligochaeta. Accessed at: http://www.marinespecies.org/aphia.php?p=taxdetails&id=2036 on 2019-10-03]]></dcterms:bibliographicCitation><dcterms:created>2004-12-21T16:54:05+01:00</dcterms:created>
<dcterms:modified>2017-06-01T14:33:21+01:00</dcterms:modified>
<dcterms:rightsHolder>WoRMS Editorial Board</dcterms:rightsHolder>
<dwc:kingdom>Animalia</dwc:kingdom>
<dwc:phylum>Annelida</dwc:phylum>
<dwc:class>Clitellata</dwc:class>
<dwc:order></dwc:order>
<dwc:family></dwc:family>
<dwc:genus></dwc:genus>
<dwc:subgenus></dwc:subgenus>
<dwc:specificEpithet></dwc:specificEpithet>
<dwc:infraspecificEpithet></dwc:infraspecificEpithet>
<dwc:taxonRank>subclass</dwc:taxonRank>
<dwc:ScientificName><![CDATA[Oligochaeta Grube, 1850]]></dwc:ScientificName>
<dwc:scientificNameAuthorship><![CDATA[Grube, 1850]]></dwc:scientificNameAuthorship>
<dwc:taxonomicStatus><![CDATA[accepted]]></dwc:taxonomicStatus>
<dwc:namePublishedIn><![CDATA[Grube, Adolf Eduard. (1850). Die Familien der Anneliden. <em>Archiv für Naturgeschichte, Berlin.</em> 16(1): 249-364.]]></dwc:namePublishedIn>
<dwc:namePublishedInYear>1850</dwc:namePublishedInYear><dwc:scientificNameID rdf:resource="urn:lsid:marinespecies.org:taxname:2036" />
<dwc:parentNameUsageID rdf:resource="urn:lsid:marinespecies.org:taxname:14165" /> </rdf:Description>
</rdf:RDF>
The point of (dwc) archives is that it is NOT linked data. But if we had a (WoRMS) checklist that defined those IDs we could cross reference them so the taxonomic information would not have to be repeated in the occurrences.
In that case, should it be an issue for the CoL+? https://github.com/Sp2000/colplus
To some degree yes, but it is primarily an Occurrence interpretation issue
To answer your questions @mdoering
You have a WoRMS checklist that defines those: https://www.gbif.org/dataset/2d59e5db-57ad-41ff-97d6-11f5fb264527
I think referring to a known checklist like WoRMS and reusing their taxonIDs makes a lot of sense and GBIF should support that in the long run. @timrobertson100 maybe the pipelines project can be a good way to include such a taxonID lookup.
Still there are many detail questions, I have a few popping up immediately:
Thanks @bart-v @albenson-usgs
Currently dwc:scientificNameID
just passes through ignored - but that just reflects the state of play when that codebase was written and the term was not well used. That is not the case today, and I agree GBIF should make use of it for cases when it clearly identifies e.g. WoRMS, IPNI, Index Fungorum records - especially as it is the OBIS recommendation to publishers.
I will move this issue into the gbif pipelines project, where we'll implement it working through the issues @mdoering rasies. All effort right now is on making the new ingestion pipeline live.
For current links, ~all~ almost all Danish Mycological Society, fungal records database records contain scientificNameID
pointing to Index Fungorum such as this example.
Edited to add: There are a few obscure records where this doesn't doesn't hold true, but they are rare
@mdoering about finding out what checklist (version) has been used, everything is solved by using a proper and persistent GUID (like LSID): it tells you what authority has been used, on a per record basis.
I don't understand this question
if we rely on globally unique ids (...) how do we know which checklist is the authority in case several checklists use these ids?
If it's a GUID, there is only one single checklist who has assigned/generated this GUID, so there is nothing to choose from?
Thanks @timrobertson100
@bart-v a properly versioned LSID would tell you what it was when resolving it. But I doubt a DwC WoRMS archive contains all historical versions of a name or deleted names.
My point about a non unique GUID is that there might be various datasets, e.g. molluscabase, WoRMS, Catalogue Of Life that all use the same GUID. Knowing which is the authorative one seems trivial by looking at the domain, but I would expect we better have some metadata about that on the dataset level. I am sure GUIDs will not appear once only.
WoRMS could do versions but that is usually overkill. We hardly ever change names, but create new ones ans point to them to each other. We do keep track of deletions.
I agree that some metadata on dataset level is needed, indeed.
My point about a non unique GUID is that there might be various datasets, e.g. molluscabase, WoRMS, Catalogue Of Life that all use the same GUID. Knowing which is the authorative one seems trivial by looking at the domain, but I would expect we better have some metadata about that on the dataset level. I am sure GUIDs will not appear once only.
There can't be a "non unique GUID". It's in the name: "Globally unique..." As Bart says at the top, his taxa are urn:lsid:marinespecies.org:... and those in Tim's example are urn:lsid:indexfungorum:...
I don't think it matters which name list is authorative! Only that the user can see which was used. As they can, when the urn:lsid: format is used. [Note: To be fair, our scientificNameID
are in the form https://www.marinespecies.org/aphia.php?p=taxdetails&id=1, as required by EurOBIS, which we have argued is wrong, particularly when they use urn: for other vocabs!]
The most distressing thing about this issue is that i can see the simple solution to my #934 is to remove scientificName
from my datasets! It will make the data less useful to GBIF but at least it won't be wrong! And OBIS will be happy.
In an case, it's wrong for GBIF to make assumptions abut my data.
Hi folks
To try and address some of the challenges I think we could make a good step forward with a fairly simple solution. What do people think about the following, please?
Taking this record as an example, it comes with:
scientificName: Megaptera novaeangliae
scientificNameID: urn:lsid:marinespecies.org:taxname:137092
In the processing we could do the following:
scientificNameID
contains an identifier we've enabled in configuration based on the prefix of urn:lsid:marinespecies.org
nubKey
(the backbone key) which we'd then use to populate the names and necessary backbone identifiers for the recordThis approach would use the identifier mapping to find things in the GBIF backbone which is a more robust mapping than the names-based lookup service.
There will always be some inconsistency due to the publishing cycle (e.g. occurrence records with names not in the latest WoRMS dataset) but it would at least 1) improve the homonym cases, and 2) improve the cases where only IDs are provided.
To get a sense of which prefixes would be suitable to map against a checklist please see this:
SELECT substring(scientificNameID, 1, 15) as prefix, count(*) AS records
FROM prod_h.occurrence
GROUP BY substring(scientificNameID, 1, 15)
HAVING count(*)>250000
ORDER BY records DESC
(removing some noise) yields:
urn:lsid:marine 66526389
urn:lsid:itis.g 15637610
urn:lsid:dyntax 2303347
urn:lsid:biosci 1454750
urn:lsid:indexf 1065547
urn:lsid:ipni.o 448138
http://www.mari 296647
What do you think? Thanks
That looks good to me. That last row returned by your query is probably all from datasets submitted to EurOBIS!
There will always be some inconsistency due to the publishing cycle (e.g. occurrence records with names not in the latest WoRMS dataset)
I had to find the ID in WoRMS before I published the dataset. The only time that could happen is if the record was deleted, but that would be exceedingly rare (generally, invalid taxa are flagged as invalid but ironically an invalid ID is just as valid for the purpose!). I would expect other authorities to do the same.
Perfect @timrobertson100 !
This is great @timrobertson100 !! Thank you so much for this! I appreciate it!
Hi everyone,
there seems to be a misunderstanding regarding the EurOBIS format requirement of the scientificNameID field that I would just like to clarify.
In EurOBIS the only accepted format for scientificNameID is "urn:lsid:marinespecies.org:taxname:1" and NOT "https://www.marinespecies.org/aphia.php?p=taxdetails&id=1"
This misunderstanding may come from a different conversation that I had with Derek Broughton ( @auspex ) in March 2020 where at EurOBIS we do require a specific format for the fields measurementTypeID, measurementValueID and measurementUnitID. That format being the "URI" (as in "http://vocab.nerc.ac.uk/collection/P01/current/OCOUNT01/") instead the "Identifier" (as in "SDN:P01::OCOUNT01"). Which of course is arguable, but a completely different conversation.
I hope this helps,
Thanks
@rubenpp7 - thanks for that. I can't look in detail now but there are 57 datasets using http://www.mari...
in scientificNameID
. I've listed the dataset keys and record count in this CSV in case someone wishes to contact the originator.
@rubenpp7 You're right, I got that backwards. In which case, I can't help reduce the inconsistency: ours are urn:lsid:marinespecies.org:taxname:
@timrobertson100 thanks a lot for the list, I have randomly checked 12 of those datasets and all of them are Published by the Australian Antarctic Data Centre. I'll point them at this github issue 😄
Thanks. To save effort - two publishers are involved. The other is this one with 3 records.
I'll point them at this github issue
Thank you
Great, I contacted them as well, hopefully we will get some answers from themselves here soon.
Cheers
I've put together an implementation and have started running some tests on data. Please can I have some guidance on how people would expect this to work for this kind of situation?
Given a record with the following:
scientificNameID: urn:lsid:marinespecies.org:taxname:1328690
kingdom: Animalia
phylum: Mollusca
class: Gastropoda Cuvier, 1797
order: Stylommatophora Schmidt, 1856
family: Helicidae Rafinesque, 1815
genus: Helix Linnaeus, 1758
scientificName: Helix paromphala Lowe,
In production today, we ignore scientificNameID
and run this lookup which detects it to be Helix poromphala R.T.Lowe, 1852, a synonym of Discula polymorpha poromphala (R.T.Lowe, 1852). Assuming paromphala is a misspelling(???) of poromphala the name-based approach seems to agree with WoRMS.
However, if we use the scientificNameID
which is the WorMS concept of Euhadra peliomphala (L. Pfeiffer, 1850) the lookup would lead us to the Euhadra peliomphala (L.Pfeiffer, 1850) concept in GBIF (note it has the basionym of Helix peliomphala L.Pfeiffer, 1850 which might suggest some fuzzy name match or perhaps mistyping when the person was looking up the ID to put on the record?).
This is one of several examples in the first records I've explored that seem to have contradictions between the name and what the ID resolves to. Please do point out anything I am doing obviously wrong too and bear in mind all these names are unknown to me.
Thanks
I'd say, there's likely something wrong with that, in any case, as the first thing that happens when I go to the WoRMS link is that it tells me "Oops! This taxon is out of scope! The taxon you have searched for is non-marine." Which suggests it's either not what it was idenitified as, or WoRMS is not the definitive source for identification (in this case, poromphala is also terrestrial, so perhaps the urn should have been MolluscaBase rather than WoRMS).
Personally, I don't think GBIF should even attempt to fill in the hierarchy if the scientificName
and scientificNameId
don't match. In this case, the scientificName
doesn't even directly match anything in either WoRMS or GBIF. If GBIF does add the hierarchy, or finds any instance of fields not matching what you'd expect, I'd appreciate an email with a summary of the fields added or issues found.
I feel your approach is right--in this case it's simply identifying a case where there needs to be more QA of the dataset.
I'd say, there's likely something wrong with that,...
Oops. That's me again... I keep forgetting to use the correct GitHub account.
Just a quick point of clarification although I am by no means a WoRMS rep or anything. Just an OBIS node manager. WoRMS does have terrestrial names in it. You just have to click the little radio button in the upper right "marine only" from on to off. This does work and gives the full taxonomic hierarchy and all the usual WoRMS info https://www.marinespecies.org/aphia.php?p=taxdetails&id=1328690.
Also yes this is likely a fuzzy match issue. When I put "Helix paromphala" into the quick search the first result it gives me is the one the scientificNameID is identifying.
Yes, obviously WoRMS does have terrestrial species, but my point is that if the contributor is providing terrestrial data, WoRMS was likely not the right source for looking up the taxonomy. When I populate a dataset, that's one of the first QA checks: "is it marine?"
For the example of Tim:
Notice: I'm assuming that scientificNameID is coming from an occurrence record We don't know if the name is coming from a OBIS dataset or not, so we cannot judge on marine vs. non-marine. And yes WoRMS (Aphia) contains non-marine taxa too.
The process in the example is just fine:
Euhadra peliomphala (L. Pfeiffer, 1850)
, since somebody put effort to standardize it.This is actually a bad example (or user error), because the correct LSID is here
https://www.marinespecies.org/aphia.php?p=taxdetails&id=1504094 urn:lsid:marinespecies.org:taxname:1504094
In both approaches (using ID vs name lookup) the outcome is currently not correct.
So maybe another example @timrobertson100 ? :)
Thanks @bart-v, all
In both approaches (using ID vs name lookup) the outcome is currently not correct.
I'm not sure about this. I might overlook something, but the name-based approach seemingly does make that match (from above):
... run this lookup which detects it to be Helix poromphala R.T.Lowe, 1852, a synonym of Discula polymorpha poromphala (R.T.Lowe, 1852).
Regardless of that, @Markus and I agree with what you outline. I suspect I will have questions about how to apply an issue - e.g. I anticipate we'll see many scientificName
containing abbreviated authorship when compared to the WoRMS concept name. Do we flag that? If not, we might find we get false positives/negatives in the flagging routine.
So maybe another example @timrobertson100 ? :)
I'll generate a report of everything that would change (without flagging, but that will come). It might serve as a useful report to approach the relevant publishers which would improve DQ in both GBIF and OBIS.
Sorry Tim, you are correct the name lookup is OK. Still that is only one example...
Using this utility I looked up the 293,862
unique classifications having WoRMS LSID. This file lists the 19,042
that would change the resulting taxon the records would link to - I haven't reviewed this yet myself beyond a cursory scan and check of a few records.
The file contains:
We're going to have to determine if - on balance - this seems like a good set of changes bearing in mind there will always be outliers and misuse.
This does not yet include any flagging.
@timrobertson100, many thanks for pulling those out and demonstrating the need for a disagreements protocol (https://discourse.gbif.org/t/millipedes-in-the-ocean/3991/7). Some of those scientificName entries have 15+ different scientificNameIDs from WoRMS.
The 19042 records are also very messy in the higher taxon fields and have numerous disagreements there, so I hope GBIF isn't planning to use those higher-taxon fields for resolving disagreements.
FYI, in data auditing DwC datasets for Pensoft data papers, nearly every time there's a scientificNameID field there are disagreements - either multiple sNIDs for sNs, or simply an incorrect sNID choice.
Thanks for taking the time to explore @Mesibov - those are helpful insights from your auditing experience.
To add - one disagreement I've spotted seems to be where the scientificNameID points to an accepted concept, where then names are using a synonym. Another common one seems to be similar names (misspellings perhaps ?) that lead to a bad ID lookup as we suspect from the Helix paromphala example above
Can someone shed light on how WoRMS identifiers are assigned to OBIS records? Is this a purely manual process or are scripts, fuzzy lookups etc involved?
Thank you so much for looking into this! I don't want to say anything that I am unsure of. @pieterprovoost should be the best person to answer this when he is back next week.
WoRMS identifiers in OBIS can be assigned in different ways Ordered by most->least common
Fuzzy lookups are supported in (1) & (2)
I think we should interpret using the scientificNameId, as it's then in the publisher's power to change their data unambiguously — and generally, if we receive identifiers we should expect them to have been assigned carefully.
An issue (TAXON_MISMATCH like COUNTRY_MISMATCH? Or TAXON_IDENTIFIER_CONFLICT?) where the lookup based on scientificNameId conflicts with that made on the name string parts is useful.
100% agree with @MattBlissett and it should not just be the scientificNameID field but also dwc:taxonID and dwc:taxonConceptID if we understand those identifiers.
@MattBlissett: "An issue (TAXON_MISMATCH like COUNTRY_MISMATCH? Or TAXON_IDENTIFIER_CONFLICT?) where the lookup based on scientificNameId conflicts with that made on the name string parts is useful."
As demonstrated earlier, this is necessary, not just useful, because although you might "expect them [IDs] to have been assigned carefully", @timrobertson100 has provided data to show that in practice the coupling of name to ID can be sloppy.
For this flagging of an issue, are you also proposing that GBIF track all the scientificNameID sources and compare the one selected for a record with the name, or just WoRMS APHIA IDs?
Also, what will happen to the original scientificName and scientificNameID entries in a record where GBIF detects a conflict? Will both continue to appear in the interpreted record?
Further: @timrobertson100 found records with defective WoRMs IDs, like "urn:lsid:marinespecies.org:taxname:0000000000000000614620". Would that be another, separate issue to flag?
Further: @timrobertson100 found records with defective WoRMs IDs, like "urn:lsid:marinespecies.org:taxname:0000000000000000614620". Would that be another, separate issue to flag?
Yes!
We currently do not natively resolve ids, but instead look them up in checklists that are published to GBIF, e.g. WoRMS, ITIS, IPNI, ZooBank. That at least frees us from (temporarily) broken or slow infrastructure. But potentially there might be very new ids in use in occurrences which we have not yet seen in the published checklists (which are updated at different frequencies, often monthly). Or deprecated ones. Still I would think it is useful to know that there was an ID given which GBIF was not able to resolve in one way or another. I guess that would also be true for scientificNameID identifiers that are generally unknown to us or not globally unique, e.g. 1234
? Maybe TAXON/NAME_ID_IGNORED flags would be useful?
I agree that we should have a "scientific name and identifier mismatch". In case of conflict, it would probably make sense to privilege the scientific name over the ID because it would make it more transparent to users. A scientific name is human readable and doesn't require checking an external source.
--
Not sure about the example above ("urn:lsid:marinespecies.org:taxname:0000000000000000614620") where it looks like we have a WoRM prefix but the identifier doesn't exist. Is it something we could check and flag?
Either, we could have a a general warning for "scientific name ID not matched". But in that case, any scientific name ID that aren't in the sources we check would be flagged. Or we could have a flag only for the scientific names IDs where we have a prefix that we recognise which might be a bit difficult to implement?
This issue was moved from portal-feedback to pipelines
Example https://www.gbif-uat.org/occurrence/search?dataset_key=740cf4e0-37ca-4389-ba8f-4e1bc5177893&taxon_key=5401803
Lists the records as Oligochaeta and appends the authority "K.Koch" just like that. That makes these marine occurrences terrestrial plants...
While a scientificNameID urn:lsid:marinespecies.org:taxname:2036 is provided, that can be resolved to the animal class Oligochaeta.
This is a missed chance to fix homonyms in an easy way...