gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

DwC field scientificNameID is not used at all #217

Closed bart-v closed 1 year ago

bart-v commented 6 years ago

This issue was moved from portal-feedback to pipelines

Example https://www.gbif-uat.org/occurrence/search?dataset_key=740cf4e0-37ca-4389-ba8f-4e1bc5177893&taxon_key=5401803

Lists the records as Oligochaeta and appends the authority "K.Koch" just like that. That makes these marine occurrences terrestrial plants...

While a scientificNameID urn:lsid:marinespecies.org:taxname:2036 is provided, that can be resolved to the animal class Oligochaeta.

This is a missed chance to fix homonyms in an easy way...

MortenHofft commented 6 years ago

Above link is using UAT (User acceptance testing). The result might be the same, but test environment and production environment is likely to be different much of the time.

I suggest that you use the production site https://www.gbif.org instead

MattBlissett commented 6 years ago

(UAT and the production environment are usually very similar for record interpretation.)

@mdoering can help explain what's happening here. I can see we have the name from the WoRMS checklist: https://www.gbif.org/species/105760798 but it isn't linked in to our taxonomic backbone. This one from a different checklist is, but it loses the author.

We don't yet match using the scientificNameId, we should look to support this for the ingestion pipeline rewrite (in progress this year).

Adding a kingdom will make this match to the correct name.

mdoering commented 6 years ago

yes, scientificNameId is pretty much ignored in both occurrence and checklist processing. With a a scientificNameId value from occurrences pointing to another checklist or even something outside of GBIF indexed data it will not become a simple exercise. All occurrences must link to a Backbone species, not some other checklist. For that to happen the backbone would a) have to have such a species (with that authorship) and b) know about the global scientificNameIds used in other lists. Maybe sth for when we have switched the backbone to use CoL+

ManonGros commented 5 years ago

Most of the links on this issue are deprecated. Plus, as far as I understand, this is not something that we can fix. Could we close this issue?

bart-v commented 5 years ago

This is not about dead links, but about the fileld dwc:scientificNameID being ignored by GBIF I thinks it's a very important issue

albenson-usgs commented 5 years ago

Agree this is an important issue, especially for OBIS node contributions. This really is a missed opportunity for GBIF as OBIS nodes take great care in assigning an appropriate scientificNameID to each occurrence. Would hate to see any records from the OBIS-USA node end up as terrestrial species when we've taken the time to provide the marine representation.

ManonGros commented 5 years ago

In that case, should it be an issue for the CoL+? https://github.com/Sp2000/colplus

mdoering commented 5 years ago

I am wondering about a few things here:

  1. why does the name have a classification? dwc:scientificNameID should point to nomenclatural information. Taxon concepts and classifications would be dwc:taxonConceptID or even just dwc:taxonID

  2. We never dealt with DwC archives pointing to external information. All archive IDs can be resolved locally within the archive. This is not true for dwc:scientificNameID

  3. For resolving external IDs there is no standard format, protocol or anything alike. Its quite a burden to know all variations in advance and issue http calls to resolve each ID.

  4. Is there really extra information in the linked name data that would help us to better interpret the name & its classification? Isnt all that information already given in the DwC occurrence record?

Looking at one of the Oligochaeta Koch examples I see the taxonomic dwc occurrence information is very sparse: https://www.gbif.org/occurrence/1324564024 It is just the name, not even a rank, kingdom or anything else. The ID would have made a difference here. But would it be difficult to enrich the occurrence data?

http://lsid.info/urn:lsid:marinespecies.org:taxname:2036

<?xml version="1.0"?><rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dwc="http://rs.tdwg.org/dwc/terms/"

>
    <rdf:Description rdf:about="urn:lsid:marinespecies.org:taxname:2036">
        <dc:type>ScientificName</dc:type>
        <dc:date>2019-10-03</dc:date>
        <dc:subject><![CDATA[Oligochaeta Grube, 1850]]></dc:subject>
      <dc:title><![CDATA[Oligochaeta]]></dc:title>
      <dc:relation><![CDATA[http://www.marinespecies.org/aphia.php?p=taxdetails&amp;id=2036]]></dc:relation><dc:creator><![CDATA[Timm, Tarmo]]></dc:creator><dc:creator><![CDATA[van Haaren, Ton]]></dc:creator><dc:identifier>urn:lsid:marinespecies.org:taxname:2036</dc:identifier>
      <dc:publisher>World Register of Marine Species (WoRMS)</dc:publisher>
      <dc:license>http://creativecommons.org/licenses/by/4.0/</dc:license>
      <dc:language>en</dc:language>
<dcterms:bibliographicCitation><![CDATA[WoRMS (2019). Oligochaeta. Accessed at: http://www.marinespecies.org/aphia.php?p=taxdetails&id=2036 on 2019-10-03]]></dcterms:bibliographicCitation><dcterms:created>2004-12-21T16:54:05+01:00</dcterms:created>
      <dcterms:modified>2017-06-01T14:33:21+01:00</dcterms:modified>
<dcterms:rightsHolder>WoRMS Editorial Board</dcterms:rightsHolder>
<dwc:kingdom>Animalia</dwc:kingdom>
      <dwc:phylum>Annelida</dwc:phylum>
      <dwc:class>Clitellata</dwc:class>
      <dwc:order></dwc:order>
      <dwc:family></dwc:family>
      <dwc:genus></dwc:genus>
      <dwc:subgenus></dwc:subgenus>
      <dwc:specificEpithet></dwc:specificEpithet>
      <dwc:infraspecificEpithet></dwc:infraspecificEpithet>
      <dwc:taxonRank>subclass</dwc:taxonRank>
      <dwc:ScientificName><![CDATA[Oligochaeta Grube, 1850]]></dwc:ScientificName>
      <dwc:scientificNameAuthorship><![CDATA[Grube, 1850]]></dwc:scientificNameAuthorship>
      <dwc:taxonomicStatus><![CDATA[accepted]]></dwc:taxonomicStatus>
<dwc:namePublishedIn><![CDATA[Grube, Adolf Eduard. (1850). Die Familien der Anneliden. <em>Archiv für Naturgeschichte, Berlin.</em> 16(1): 249-364.]]></dwc:namePublishedIn>
  <dwc:namePublishedInYear>1850</dwc:namePublishedInYear><dwc:scientificNameID rdf:resource="urn:lsid:marinespecies.org:taxname:2036" />
     <dwc:parentNameUsageID rdf:resource="urn:lsid:marinespecies.org:taxname:14165" />  </rdf:Description>
</rdf:RDF>
mdoering commented 5 years ago

The point of (dwc) archives is that it is NOT linked data. But if we had a (WoRMS) checklist that defined those IDs we could cross reference them so the taxonomic information would not have to be repeated in the occurrences.

mdoering commented 5 years ago

In that case, should it be an issue for the CoL+? https://github.com/Sp2000/colplus

To some degree yes, but it is primarily an Occurrence interpretation issue

bart-v commented 5 years ago

To answer your questions @mdoering

  1. Point taken. WoRMS does not make a good distinction between names and concepts. This is work in progress.
  2. We can't make all providers of Occurrence data responsible for the names, and ask them to generate a Darwin Core Taxon extension
  3. True, so we need some custom code, no big deal right?
  4. Yes, there is. OBIS even advises to leave out any Taxon related field and focus on the scientificNameID because it's impossible to keep track of all the names & synonyms in the long run

You have a WoRMS checklist that defines those: https://www.gbif.org/dataset/2d59e5db-57ad-41ff-97d6-11f5fb264527

mdoering commented 5 years ago

I think referring to a known checklist like WoRMS and reusing their taxonIDs makes a lot of sense and GBIF should support that in the long run. @timrobertson100 maybe the pipelines project can be a good way to include such a taxonID lookup.

Still there are many detail questions, I have a few popping up immediately:

timrobertson100 commented 5 years ago

Thanks @bart-v @albenson-usgs

Currently dwc:scientificNameID just passes through ignored - but that just reflects the state of play when that codebase was written and the term was not well used. That is not the case today, and I agree GBIF should make use of it for cases when it clearly identifies e.g. WoRMS, IPNI, Index Fungorum records - especially as it is the OBIS recommendation to publishers.

I will move this issue into the gbif pipelines project, where we'll implement it working through the issues @mdoering rasies. All effort right now is on making the new ingestion pipeline live.

timrobertson100 commented 5 years ago

For current links, ~all~ almost all Danish Mycological Society, fungal records database records contain scientificNameID pointing to Index Fungorum such as this example.

Edited to add: There are a few obscure records where this doesn't doesn't hold true, but they are rare

bart-v commented 5 years ago

@mdoering about finding out what checklist (version) has been used, everything is solved by using a proper and persistent GUID (like LSID): it tells you what authority has been used, on a per record basis.

I don't understand this question

if we rely on globally unique ids (...) how do we know which checklist is the authority in case several checklists use these ids?

If it's a GUID, there is only one single checklist who has assigned/generated this GUID, so there is nothing to choose from?

bart-v commented 5 years ago

Thanks @timrobertson100

mdoering commented 5 years ago

@bart-v a properly versioned LSID would tell you what it was when resolving it. But I doubt a DwC WoRMS archive contains all historical versions of a name or deleted names.

My point about a non unique GUID is that there might be various datasets, e.g. molluscabase, WoRMS, Catalogue Of Life that all use the same GUID. Knowing which is the authorative one seems trivial by looking at the domain, but I would expect we better have some metadata about that on the dataset level. I am sure GUIDs will not appear once only.

bart-v commented 5 years ago

WoRMS could do versions but that is usually overkill. We hardly ever change names, but create new ones ans point to them to each other. We do keep track of deletions.

I agree that some metadata on dataset level is needed, indeed.

auspex commented 1 year ago

My point about a non unique GUID is that there might be various datasets, e.g. molluscabase, WoRMS, Catalogue Of Life that all use the same GUID. Knowing which is the authorative one seems trivial by looking at the domain, but I would expect we better have some metadata about that on the dataset level. I am sure GUIDs will not appear once only.

There can't be a "non unique GUID". It's in the name: "Globally unique..." As Bart says at the top, his taxa are urn:lsid:marinespecies.org:... and those in Tim's example are urn:lsid:indexfungorum:...

I don't think it matters which name list is authorative! Only that the user can see which was used. As they can, when the urn:lsid: format is used. [Note: To be fair, our scientificNameID are in the form https://www.marinespecies.org/aphia.php?p=taxdetails&id=1, as required by EurOBIS, which we have argued is wrong, particularly when they use urn: for other vocabs!]

The most distressing thing about this issue is that i can see the simple solution to my #934 is to remove scientificName from my datasets! It will make the data less useful to GBIF but at least it won't be wrong! And OBIS will be happy.

In an case, it's wrong for GBIF to make assumptions abut my data.

timrobertson100 commented 1 year ago

Hi folks

To try and address some of the challenges I think we could make a good step forward with a fairly simple solution. What do people think about the following, please?

Taking this record as an example, it comes with:

scientificName: Megaptera novaeangliae
scientificNameID: urn:lsid:marinespecies.org:taxname:137092

In the processing we could do the following:

  1. Detect that scientificNameID contains an identifier we've enabled in configuration based on the prefix of urn:lsid:marinespecies.org
  2. We'd look that up against the reference checklist (we'd configure that prefix to point to the WoRMs checklist) using this API call
  3. The response has the nubKey (the backbone key) which we'd then use to populate the names and necessary backbone identifiers for the record

This approach would use the identifier mapping to find things in the GBIF backbone which is a more robust mapping than the names-based lookup service.

There will always be some inconsistency due to the publishing cycle (e.g. occurrence records with names not in the latest WoRMS dataset) but it would at least 1) improve the homonym cases, and 2) improve the cases where only IDs are provided.

To get a sense of which prefixes would be suitable to map against a checklist please see this:

SELECT substring(scientificNameID, 1, 15) as prefix, count(*) AS records 
FROM prod_h.occurrence 
GROUP BY substring(scientificNameID, 1, 15) 
HAVING count(*)>250000 
ORDER BY records DESC

(removing some noise) yields:

urn:lsid:marine 66526389
urn:lsid:itis.g 15637610
urn:lsid:dyntax 2303347
urn:lsid:biosci 1454750
urn:lsid:indexf 1065547
urn:lsid:ipni.o 448138
http://www.mari 296647

What do you think? Thanks

derek-mba commented 1 year ago

That looks good to me. That last row returned by your query is probably all from datasets submitted to EurOBIS!

There will always be some inconsistency due to the publishing cycle (e.g. occurrence records with names not in the latest WoRMS dataset)

I had to find the ID in WoRMS before I published the dataset. The only time that could happen is if the record was deleted, but that would be exceedingly rare (generally, invalid taxa are flagged as invalid but ironically an invalid ID is just as valid for the purpose!). I would expect other authorities to do the same.

bart-v commented 1 year ago

Perfect @timrobertson100 !

ymgan commented 1 year ago

This is great @timrobertson100 !! Thank you so much for this! I appreciate it!

rubenpp7 commented 1 year ago

Hi everyone,

there seems to be a misunderstanding regarding the EurOBIS format requirement of the scientificNameID field that I would just like to clarify.

In EurOBIS the only accepted format for scientificNameID is "urn:lsid:marinespecies.org:taxname:1" and NOT "https://www.marinespecies.org/aphia.php?p=taxdetails&id=1"

This misunderstanding may come from a different conversation that I had with Derek Broughton ( @auspex ) in March 2020 where at EurOBIS we do require a specific format for the fields measurementTypeID, measurementValueID and measurementUnitID. That format being the "URI" (as in "http://vocab.nerc.ac.uk/collection/P01/current/OCOUNT01/") instead the "Identifier" (as in "SDN:P01::OCOUNT01"). Which of course is arguable, but a completely different conversation.

I hope this helps,

Thanks

timrobertson100 commented 1 year ago

@rubenpp7 - thanks for that. I can't look in detail now but there are 57 datasets using http://www.mari... in scientificNameID. I've listed the dataset keys and record count in this CSV in case someone wishes to contact the originator.

derek-mba commented 1 year ago

@rubenpp7 You're right, I got that backwards. In which case, I can't help reduce the inconsistency: ours are urn:lsid:marinespecies.org:taxname:

rubenpp7 commented 1 year ago

@timrobertson100 thanks a lot for the list, I have randomly checked 12 of those datasets and all of them are Published by the Australian Antarctic Data Centre. I'll point them at this github issue 😄

timrobertson100 commented 1 year ago

Thanks. To save effort - two publishers are involved. The other is this one with 3 records.

I'll point them at this github issue

Thank you

rubenpp7 commented 1 year ago

Great, I contacted them as well, hopefully we will get some answers from themselves here soon.

Cheers

timrobertson100 commented 1 year ago

I've put together an implementation and have started running some tests on data. Please can I have some guidance on how people would expect this to work for this kind of situation?

Given a record with the following:

scientificNameID: urn:lsid:marinespecies.org:taxname:1328690
kingdom: Animalia
phylum: Mollusca
class: Gastropoda Cuvier, 1797
order: Stylommatophora Schmidt, 1856
family: Helicidae Rafinesque, 1815
genus: Helix Linnaeus, 1758
scientificName: Helix paromphala Lowe,

In production today, we ignore scientificNameID and run this lookup which detects it to be Helix poromphala R.T.Lowe, 1852, a synonym of Discula polymorpha poromphala (R.T.Lowe, 1852). Assuming paromphala is a misspelling(???) of poromphala the name-based approach seems to agree with WoRMS.

However, if we use the scientificNameID which is the WorMS concept of Euhadra peliomphala (L. Pfeiffer, 1850) the lookup would lead us to the Euhadra peliomphala (L.Pfeiffer, 1850) concept in GBIF (note it has the basionym of Helix peliomphala L.Pfeiffer, 1850 which might suggest some fuzzy name match or perhaps mistyping when the person was looking up the ID to put on the record?).

This is one of several examples in the first records I've explored that seem to have contradictions between the name and what the ID resolves to. Please do point out anything I am doing obviously wrong too and bear in mind all these names are unknown to me.

Thanks

auspex commented 1 year ago

I'd say, there's likely something wrong with that, in any case, as the first thing that happens when I go to the WoRMS link is that it tells me "Oops! This taxon is out of scope! The taxon you have searched for is non-marine." Which suggests it's either not what it was idenitified as, or WoRMS is not the definitive source for identification (in this case, poromphala is also terrestrial, so perhaps the urn should have been MolluscaBase rather than WoRMS).

Personally, I don't think GBIF should even attempt to fill in the hierarchy if the scientificName and scientificNameId don't match. In this case, the scientificName doesn't even directly match anything in either WoRMS or GBIF. If GBIF does add the hierarchy, or finds any instance of fields not matching what you'd expect, I'd appreciate an email with a summary of the fields added or issues found.

I feel your approach is right--in this case it's simply identifying a case where there needs to be more QA of the dataset.

derek-mba commented 1 year ago

I'd say, there's likely something wrong with that,...

Oops. That's me again... I keep forgetting to use the correct GitHub account.

albenson-usgs commented 1 year ago

Just a quick point of clarification although I am by no means a WoRMS rep or anything. Just an OBIS node manager. WoRMS does have terrestrial names in it. You just have to click the little radio button in the upper right "marine only" from on to off. This does work and gives the full taxonomic hierarchy and all the usual WoRMS info https://www.marinespecies.org/aphia.php?p=taxdetails&id=1328690.

albenson-usgs commented 1 year ago

Also yes this is likely a fuzzy match issue. When I put "Helix paromphala" into the quick search the first result it gives me is the one the scientificNameID is identifying.

Capture

derek-mba commented 1 year ago

Yes, obviously WoRMS does have terrestrial species, but my point is that if the contributor is providing terrestrial data, WoRMS was likely not the right source for looking up the taxonomy. When I populate a dataset, that's one of the first QA checks: "is it marine?"

bart-v commented 1 year ago

For the example of Tim:

Notice: I'm assuming that scientificNameID is coming from an occurrence record We don't know if the name is coming from a OBIS dataset or not, so we cannot judge on marine vs. non-marine. And yes WoRMS (Aphia) contains non-marine taxa too.

The process in the example is just fine:

  1. give priority to the scientificNameID, and map it to urn:lsid:marinespecies.org:taxname:1328690 Euhadra peliomphala (L. Pfeiffer, 1850), since somebody put effort to standardize it.
  2. mark the record as "an issue"
  3. later updates might update or correct the LSID

This is actually a bad example (or user error), because the correct LSID is here https://www.marinespecies.org/aphia.php?p=taxdetails&id=1504094 urn:lsid:marinespecies.org:taxname:1504094

In both approaches (using ID vs name lookup) the outcome is currently not correct.

So maybe another example @timrobertson100 ? :)

timrobertson100 commented 1 year ago

Thanks @bart-v, all

In both approaches (using ID vs name lookup) the outcome is currently not correct.

I'm not sure about this. I might overlook something, but the name-based approach seemingly does make that match (from above):

... run this lookup which detects it to be Helix poromphala R.T.Lowe, 1852, a synonym of Discula polymorpha poromphala (R.T.Lowe, 1852).

Regardless of that, @Markus and I agree with what you outline. I suspect I will have questions about how to apply an issue - e.g. I anticipate we'll see many scientificName containing abbreviated authorship when compared to the WoRMS concept name. Do we flag that? If not, we might find we get false positives/negatives in the flagging routine.

So maybe another example @timrobertson100 ? :)

I'll generate a report of everything that would change (without flagging, but that will come). It might serve as a useful report to approach the relevant publishers which would improve DQ in both GBIF and OBIS.

bart-v commented 1 year ago

Sorry Tim, you are correct the name lookup is OK. Still that is only one example...

timrobertson100 commented 1 year ago

Using this utility I looked up the 293,862 unique classifications having WoRMS LSID. This file lists the 19,042 that would change the resulting taxon the records would link to - I haven't reviewed this yet myself beyond a cursory scan and check of a few records.

The file contains:

We're going to have to determine if - on balance - this seems like a good set of changes bearing in mind there will always be outliers and misuse.

This does not yet include any flagging.

Mesibov commented 1 year ago

@timrobertson100, many thanks for pulling those out and demonstrating the need for a disagreements protocol (https://discourse.gbif.org/t/millipedes-in-the-ocean/3991/7). Some of those scientificName entries have 15+ different scientificNameIDs from WoRMS.

The 19042 records are also very messy in the higher taxon fields and have numerous disagreements there, so I hope GBIF isn't planning to use those higher-taxon fields for resolving disagreements.

FYI, in data auditing DwC datasets for Pensoft data papers, nearly every time there's a scientificNameID field there are disagreements - either multiple sNIDs for sNs, or simply an incorrect sNID choice.

timrobertson100 commented 1 year ago

Thanks for taking the time to explore @Mesibov - those are helpful insights from your auditing experience.

To add - one disagreement I've spotted seems to be where the scientificNameID points to an accepted concept, where then names are using a synonym. Another common one seems to be similar names (misspellings perhaps ?) that lead to a bad ID lookup as we suspect from the Helix paromphala example above

mdoering commented 1 year ago

Can someone shed light on how WoRMS identifiers are assigned to OBIS records? Is this a purely manual process or are scripts, fuzzy lookups etc involved?

ymgan commented 1 year ago

Thank you so much for looking into this! I don't want to say anything that I am unsure of. @pieterprovoost should be the best person to answer this when he is back next week.

bart-v commented 1 year ago

WoRMS identifiers in OBIS can be assigned in different ways Ordered by most->least common

  1. taxon match https://www.marinespecies.org/aphia.php?p=match
  2. using the API directly https://www.marinespecies.org/rest/, or via clients packages like https://docs.ropensci.org/worrms/ consuming the API
  3. manual

Fuzzy lookups are supported in (1) & (2)

MattBlissett commented 1 year ago

I think we should interpret using the scientificNameId, as it's then in the publisher's power to change their data unambiguously — and generally, if we receive identifiers we should expect them to have been assigned carefully.

An issue (TAXON_MISMATCH like COUNTRY_MISMATCH? Or TAXON_IDENTIFIER_CONFLICT?) where the lookup based on scientificNameId conflicts with that made on the name string parts is useful.

mdoering commented 1 year ago

100% agree with @MattBlissett and it should not just be the scientificNameID field but also dwc:taxonID and dwc:taxonConceptID if we understand those identifiers.

Mesibov commented 1 year ago

@MattBlissett: "An issue (TAXON_MISMATCH like COUNTRY_MISMATCH? Or TAXON_IDENTIFIER_CONFLICT?) where the lookup based on scientificNameId conflicts with that made on the name string parts is useful."

As demonstrated earlier, this is necessary, not just useful, because although you might "expect them [IDs] to have been assigned carefully", @timrobertson100 has provided data to show that in practice the coupling of name to ID can be sloppy.

For this flagging of an issue, are you also proposing that GBIF track all the scientificNameID sources and compare the one selected for a record with the name, or just WoRMS APHIA IDs?

Also, what will happen to the original scientificName and scientificNameID entries in a record where GBIF detects a conflict? Will both continue to appear in the interpreted record?

Mesibov commented 1 year ago

Further: @timrobertson100 found records with defective WoRMs IDs, like "urn:lsid:marinespecies.org:taxname:0000000000000000614620". Would that be another, separate issue to flag?

mdoering commented 1 year ago

Further: @timrobertson100 found records with defective WoRMs IDs, like "urn:lsid:marinespecies.org:taxname:0000000000000000614620". Would that be another, separate issue to flag?

Yes! We currently do not natively resolve ids, but instead look them up in checklists that are published to GBIF, e.g. WoRMS, ITIS, IPNI, ZooBank. That at least frees us from (temporarily) broken or slow infrastructure. But potentially there might be very new ids in use in occurrences which we have not yet seen in the published checklists (which are updated at different frequencies, often monthly). Or deprecated ones. Still I would think it is useful to know that there was an ID given which GBIF was not able to resolve in one way or another. I guess that would also be true for scientificNameID identifiers that are generally unknown to us or not globally unique, e.g. 1234? Maybe TAXON/NAME_ID_IGNORED flags would be useful?

ManonGros commented 1 year ago

I agree that we should have a "scientific name and identifier mismatch". In case of conflict, it would probably make sense to privilege the scientific name over the ID because it would make it more transparent to users. A scientific name is human readable and doesn't require checking an external source.

--

Not sure about the example above ("urn:lsid:marinespecies.org:taxname:0000000000000000614620") where it looks like we have a WoRM prefix but the identifier doesn't exist. Is it something we could check and flag?

Either, we could have a a general warning for "scientific name ID not matched". But in that case, any scientific name ID that aren't in the sources we check would be flagged. Or we could have a flag only for the scientific names IDs where we have a prefix that we recognise which might be a bit difficult to implement?