TreatmentBank trait data adjustment

KatjaSchulz commented 3 weeks ago

For discussion.

Trait data issues

Exclude term matches in references, citations, taxon names

Many invalid trait data records seem to come from term matches in references or citations or names of collectors. We should try to exclude all references & citations from trait parsing. @eliagbayani If I remember correctly, we excluded references for the NLP keyword parsing experiments, so there should be some code we could re-use from that project? I don't remember if we also excluded citations (e.g. Ribeiro 1964: 5; Crosnier and Forest 1965) but those and the names of collectors (usually preceded by legit or coll.) we should also be able to catch based on simple rules. Examples:

Alpheus bouvieri (and many other marine crustaceans): forest - from a prolific author named Forest
Insecta: heath - from an author named Heath
Pseudochitinopoma beneliahuae: marsh - from a collector named Marsh
Neoperla leopoldina: iceberg - from the title of a cited article

Some invalid trait data records come from term matches in taxon names. We should try to exclude both scientific & vernacular names from trait parsing. Examples:

Apterichtus dunalailai (known from Vanuatu and Fiji): Malabar - from string: "It differs significantly from all but A. malabar ..."
Callitriche aucklandica (endemic to islands off Northern New Zealand): Antarctica - from several strings mentioning the closely related C. antarctica
Centrorhynchus aluconis: marsh - from string: "found in Ukraine in western marsh harrier"

Habitat values from place names

I found quite a few invalid habitat mappings due to place names including a string matched to a habitat term. Sometimes such matches can provide good mappings, often they don't. We may be able to rescue some of these by applying a marine/terrestrial filter, but even then we probably leave many spurious data records in place. I don't see an easy way to filter out place names altogether. Some of them seem to be marked up in the treatments, and we should definitely exclude those from parsing for habitat data, but many are not recognized at the source. Examples:

source text: "reef" -> marine reef
- Many erroneous mappings for non-marine species. We could try to filter out insects, arachnids, amphibians, arachnids, mammals & squamates, but I suspect that there are also many false positives among the marine taxa. Can't we get pretty good data on reef taxa from WoRMS or reef-specific checklists @jhammock ?
source text: "bay" -> bay
- Most of these trait records seem come from matches in place names, quite a few of them are for terrestrial taxa that don't actually occur in a bay or immediately next to a bay, e.g. Asphalidesmus golovatchi, Paracondeellum paradisum, Gossia vieillardii
source text: "hot spring" or "hotspring" -> hot spring
- Most mappings for this term are for organisms that cannot possibly live in hot springs (spiders, beetles, birds). There may be checklists for this habitat, too.
source text: "iceberg" -> marine iceberg
- None of the taxa mapped to this term occur on marine icebergs. Most mismappings are due to place name matches or use of the "tip of the iceberg" metaphor.
source text: "canal" -> canal
- Lots of spiders, insects and other terrestrial taxa are mapped due to string matches in place names.
- There are also quite a few invalid matches due to descriptions of alimentary canals, e.g., Metaphire taiwanensis and many other earthworms & millipedes.
source text: "mountain" -> mountain
- There are a quite a few invalid mappings of marine taxa due to place name matches, e.g., Galapagomystides verenae, Sericosura dentatus, Allocareproctus unangas
source text: "orchard" -> orchard
- Some invalid mappings of marine taxa due to place name matches, e.g., Chone aurantiaca, Zelentia nepunicea

Problematic or ambiguous geographic terms

We should probably start a list somewhere of dangerous geo terms that need some kind of a confirmation/disambiguation effort.

source text: "mon" -> Mon State, Thailand
- Many mismatches due to apparent line break, e.g., Mon- golia, Mon- tana, Mon. abbreviations in references, etc. -- Suggest to remove all records or match only explicit source text "Mon State" for this value.
source text: "malabar" -> Malabar, child of India
- While the Malabar (India) mapping is correct for most taxa, there are several that should be mapped to:
- Malabar (Australia), e.g., Atisne derelictus, Apterichtus malabar, Ampelisca dimboola, Arctides regalis
- Malabar (Florida), e.g., Lissohypnus fullertoni, Andricus fitzpatricki

Suboptimal trait label matching

If there are multiple term matches, it would be good if we would only use the longest possible substring match. This seems to work in a lot of cases already, but there are exceptions. Examples:

Scopalina kuyamu (a marine sponge): forest - Ideally, this would have been matched to "kelp forest" not just forest, because kelp forests aren't really forests.
Oxydromus humesi (a marine polychaete): marsh - Ideally, this would have been matched to "salt marsh" not just marsh, because both mentions of marsh in the treatment actually refer to salt marsh.

Taxonomic issues

Missing taxa

There are 7393 taxa with trait data that are not represented in the taxon file. See missingTaxonIDs.txt attached. If we cannot get the taxonomic data for these records, we should remove them.

Higher taxa

Please remove all records for taxa that are NOT of rank species|variety|subspecies|form. There are over 90,000 of these records. Most of them are mismapped, i.e., the trait record is attached to a genus or family or worse, but the matched value is actually providing information for a species that is not picked up by the parser. Examples:

Misparsed or malformed names

Please remove all records for taxa that have one the following strings in their scientific name values (case insensitive):

undefined|undetermined|incertae sedis

Please remove all records for:

taxa of rank species where the canonical name (simple) does not match [A-Z][a-z-]+ [a-z-]+
taxa of rank variety|subspecies|form where the canonical name (simple) does not match [A-Z][a-z-]+ [a-z-]+ [a-z-]+.

This should accommodate cases where the scientificName value includes things like a subgenus name, a var./ssp./f. abbreviation or a hybrid character. Examples:

scientificName: Cercyon (Acycreon) apiciflavus Hebauer 2002 -> canonical simple: Cercyon apiciflavus
scientificName: Galesus (G.) foersteri var. nigricornis Kieffer 1911 -> canonical simple: Galesus foersteri nigricornis
scientificName: Spartina ×townsendii H. Groves & J. Groves, Bot. Exch. Club Rep. 1880. 37. 1881. -> canonical simple: Spartina townsendii

This should automatically take care of malformed names that cannot be rescued like:

Names with abbreviated genus. These can never be matched properly to an EOL page. Examples:
- R. crataegifolius Bunge Mém. Acad. Imp. Sci. St. - Pétersbourg Divers Savans 2: 98. 1835.
- C. italicus (Linnaeus, 1758)
Surrogates with weird characters. Examples:
- "Thiara" aspera (Lesson 1831) Lesson 1831
- Bombus (Thoracobombus)?pomorum (Panzer, 1805)

There are some misparsed/malformed names that we should try to rescue:

Species names without genus. There are a bunch of reasons why the genus name may not get parsed and the species name ends up being just the epithet. If it proves to be too challenging to fix these names, we should remove them. Examples:
- neglectus Van Loon, Boomsma & Andrasfalvy 1990 - Source has special character before genus name (#)
- atavus Cockerell 1920 - Source has special character before genus name (†)
- albolucens Prout 1916 - Name looks well-formed at source, but it has the subgenus in parentheses
- griseifrons Becker 1910 - Name malformed in page header.
Some scientificName values have uppercase epithets although the epithets at the source have the appropriate lower case. It's important to fix this because upper case epithets get improperly parsed as authority data, so the names cannot be properly matched. Examples:
Name strings with colons in the taxon name or between the name and the author. It's ok to have a colon in the author string if it's used to provide a page number. Names with these issues are often fine at the source, i.e., there's no colon, but sometimes the colon originates at the source. Based on the current sample of names, it would be safe to replace all occurrences of : in scientificName values with a space EXCEPT where the : is immediately followed by a number or a space and a number. Examples:
- Heteragrion azulum : Dunkle 1989 - no colon at the source
- Herpetopoma : Pilsbry 1890 - colon at the source

Incorrect taxonomies

Over a thousand taxa have incorrect higher classification data leading to page mismappings. I found some misplaced taxa where the higher classification makes sense, just not for the taxa on question. But then there are also completely mixed-up higher classifications with crossovers between kingdoms or other major groups. These are not controversial or out-dated taxonomies. They are just plain wrong. We should try to fix all of these or remove the higher classification data if we cannot confidently resolve them. @eliagbayani where do you get the higher classification data for these taxa?

Misplaced taxa - these are hard to spot. There may be many more. Examples:

Milnesium tardigradum: Animalia|Annelida|Polychaeta|Phyllodocida|Aphroditidae - This is actually a tardigrade: Animalia|Tardigrada.
Megalomma inflata: Animalia|Arthropoda|Insecta|Coleoptera|Carabidae - This is actually a polychaete Animalia|Annelida
Astylus tucumanensis: Animalia|Cnidaria|Hydrozoa|Anthoathecata|Stylasteridae - This is actually a beetle: Animalia|Insecta|Coleoptera|Melyridae
Martinezia excavaticollis: Animalia|Amoebozoa|Lobosa|Amoebida|Entamoebidae - This is actually a beetle: Animalia|Insecta|Coleoptera|Scarabaeidae
Iconella meruloides: Chromista|Ochrophyta|Bacillariophyceae - This is actually a wasp: Animalia|Arthropoda|Insecta|Hymenoptera|Braconidae

Hierarchy mix-ups - I found a bunch of them just by scanning higher classification data in the taxon file. I am attaching a file (invalidHierarchies.txt) with the mixed-up hierarchies I found. There are probably many more than I didn't notice. Examples:

Odontonia bagginsi: Animalia|Tracheophyta|Liliopsida|Decapoda|Palaemonidae - This is indeed a palaemonid decapod, but Tracheophyta|Liliopsida are plants.
Cycloporus variegatus, Cycloporus reticulatus: Fungi|Platyhelminthes||Polycladida|Stylostomidae - Platyhelminthes are not fungi.
Bothrocophias tulitoi: Animalia|Chordata|Aves|Squamata|Viperidae - a snake, Squamata is not a child of Aves (birds) in any classification
Malanea evenosa: Animalia|Tracheophyta|Magnoliopsida|Gentianales|Rubiaceae - hierarchy is correct, except for the kingdom
Biasticus griseocapillus: Animalia|Annelida|Clitellata|Hemiptera|Reduviidae - Hemiptera are not in Annelida in any hierarchy

Incorrect treatment parsing at source

I stumbled upon one instance where the treatments were not properly parsed at the source, so the treatment for a given species actually contains multiple treatments for several species. When we parsed for traits, the traits then got mapped to the wrong species. I'm not sure if this is just an isolated case or what we can do about it if it is more widespread. It's difficult to detect because there are no obvious trait data conflict as in the case of terrestrial taxa getting marine trait records. Examples:

Enoplometopus debelius: Madagascar - Actually, Enoplometopus debelius occurs in New Caldedonia, Indonesia, and Hawaii. "Madagascar" gets picked up from the Enoplometopus occidentalis treatment which is apparently not recognized as a separate treatment.

KatjaSchulz commented 3 weeks ago

missingTaxonIDs.txt invalidHierarchies.txt

jhammock commented 3 weeks ago

Oh bother- sorry, Katja; I should have kept you better informed. Most of our large, non-branch-painted resources are currently riddled with ghost records. Jeremy was working on the delete-removed-records part of the harvest code recently but I don't think there's been any progress since the latest failed test.

Many, but not all, of your trait terminology issues above are now filtered out in most of those resource files, but they still manifest on the website. The "reef" issue remains, I think, though I want to say semi-overlapping tactics may have eliminated some of them.

You've also brought up a larger issue of strategy when rely on textmining, and particularly, areas to skip if we already have good coverage from another source -> I've never done this in an organized way, but it's a good idea. Actually, I never did find a good source of reef checklists. It's been regionally piecemeal, even among the fishes. Nevertheless, there are probably terms we should consider eliminating from the textmined data simply because we have them from another source.

And something I've tried to give Eli time for, but we all know how that goes :) -> organizing the textmining filter code, for easier re-use among resources. It's probably time all three of us compared notes about what is in use where...

jhammock commented 3 weeks ago

I'm starting to miss jira :/

Eli, it is TreatmentBank, isn't it, for which we had to remove some records with ancestry that mixed... plants and insects? As far as I can tell, @KatjaSchulz , some records combine data for multiple species (eg: the described species and its hosts) in that field. We filtered out one fairly narrow case, which I think was plants w/insects, but it sounds like it's worth creating a generalized filter for Incompatible Ancestors, using longer lists of names. I'm not sure how the computational weight of that check will scale with the number of names to compare...

I'm ok with just ditching records that fall into this category, unless one of you is keen on their salvage and thinks it'll be easier than I think...

JRice commented 3 weeks ago

Oh bother- sorry, Katja; I should have kept you better informed. Most of our large, non-branch-painted resources are currently riddled with ghost records. Jeremy was working on the delete-removed-records part of the harvest code recently but I don't think there's been any progress since the latest failed test.

Truth! I do think I've found a viable solution to the problem, but have not had the means to fully test it yet, and now I'm bogged down with admin problems. Mmmmmmaybe I'll make some headway tomorrow, but that's seeming less likely as the day continues. :S

Message ID: @.***>

KatjaSchulz commented 3 weeks ago

Just to be clear. All of the issues above manifest in the resource file. None of them are due to ghost records. I started out investigating things in the graph, but then switched to the resource file when I realized that what I was seeing in the graph didn't add up. I only relied on the graph when checking page mappings.

jhammock commented 3 weeks ago

Oh, thanks for that. I'm surprised, but it could be that I'm thinking of filters implemented in other resources. We certainly sequester non-target source text (including references) somewhere, and the place name / habitat issue has come up before also. We definitely need shared access to a record of methods used in each connector. Eli, I don't want to make that more work than necessary, especially for you. Do you have any ideas how such a record could be maintained? I've never tried to navigate your connector repository, but possibly github could be useful somehow? If they don't provide a clever automated assist for this task, I'd settle for a shared notes document listing each textmined resource and a sketchy description of the filters in place.

jhammock commented 3 weeks ago

FTR, @eliagbayani is out today. Eli, no rush. :)

KatjaSchulz commented 3 weeks ago

Yes, there's no hurry. We can start implementing things gradually. I'll have a another look and condense the above to an easy things to do right now list.

eliagbayani commented 3 weeks ago

Oh, thanks for that. I'm surprised, but it could be that I'm thinking of filters implemented in other resources. We certainly sequester non-target source text (including references) somewhere, and the place name / habitat issue has come up before also. We definitely need shared access to a record of methods used in each connector. Eli, I don't want to make that more work than necessary, especially for you. Do you have any ideas how such a record could be maintained? I've never tried to navigate your connector repository, but possibly github could be useful somehow? If they don't provide a clever automated assist for this task, I'd settle for a shared notes document listing each textmined resource and a sketchy description of the filters in place.

@jhammock I have a general library (in one place, most of it at least) for such filters but it is not in an English readable format nor in a shared easy-edit manner. I can start a shared notes (Google spreadsheet) and I will add to it everything I have in our connectors. And I will eventually also base all filter rules for our connectors in that spreadsheet. This answers the need for a common, accessible, editable place for all our filters. What do you think? Thanks.

jhammock commented 3 weeks ago

That sounds good to me!

eliagbayani commented 2 weeks ago

@KatjaSchulz TreatmentBank: This is the EOL resource: https://content.eol.org/resources/562 This is the OpenData record: https://opendata.eol.org/dataset/treatmentbank/resource/8eaf28c4-67eb-4eb0-931e-7ac8f89a87cf And the DwCA: https://editors.eol.org/eol_php_code/applications/content_server/resources/TreatmentBank_final.tar.gz DwCA was last generated: Jan 19, 2024

Just an initial comment on [Incorrect taxonomies] -> [Misplaced taxa] -> [Milnesium tardigradum] The taxon.tab has five (5) taxon entries for "Milnesium tardigradum". Each was given by TreatmentBank a taxonID and its own page. Two are Tardigrada and three are Annelida. All have MoF entries. For this resource (TreatmentBank), I did not compute the higher classification using let say a parentNameUsageID. I just copied the higher classification as it is. By the way, the taxon.tab doesn't have a higherClassification field but just the ancestry fields (kingdom phylum class order family genus).

https://treatment.plazi.org/id/039E6F5BB849D754FC036EB0D7DE55A0 Milnesium tardigradum Doyere 1840 Animalia Tardigrada Eutardigrada Apochela Milnesiidae Milnesium species Doyere 1840
https://treatment.plazi.org/id/03C0D659FFC6CB3AFE71E48D28509675 Milnesium tardigradum Doyere 1840 Animalia Tardigrada Eutardigrada Apochela Milnesiidae Milnesium species Doyere 1840
https://treatment.plazi.org/id/03ED740D32047E50FF6BFA556FC4FDC6 Milnesium tardigradum Kaczmarek 2017 Animalia Annelida Polychaeta Phyllodocida Aphroditidae Milnesium species Kaczmarek 2017 spec. nov.
https://treatment.plazi.org/id/03A987E9733DFFC81AB77177FBDCE18D Milnesium tardigradum Animalia Annelida Polychaeta Phyllodocida Aphroditidae Milnesium species spec. nov.
https://treatment.plazi.org/id/03B6C2705205FFFBFE8BFEEEFE5EFC77 Milnesium tardigradum Doyere Animalia Annelida Polychaeta Phyllodocida Aphroditidae Milnesium species Doyere

Seems we need to fix this during the creation of the DwCA. Choose only the names we want? Thanks.

KatjaSchulz commented 2 weeks ago

@KatjaSchulz TreatmentBank: This is the EOL resource: https://content.eol.org/resources/562 This is the OpenData record: https://opendata.eol.org/dataset/treatmentbank/resource/8eaf28c4-67eb-4eb0-931e-7ac8f89a87cf And the DwCA: https://editors.eol.org/eol_php_code/applications/content_server/resources/TreatmentBank_final.tar.gz DwCA was last generated: Jan 19, 2024

Yes, these are the files I checked.

Just an initial comment on [Incorrect taxonomies] -> [Misplaced taxa] -> [Milnesium tardigradum] The taxon.tab has five (5) taxon entries for "Milnesium tardigradum". Each was given by TreatmentBank a taxonID and its own page. Two are Tardigrada and three are Annelida. All have MoF entries. For this resource (TreatmentBank), I did not compute the higher classification using let say a parentNameUsageID. I just copied the higher classification as it is. By the way, the taxon.tab doesn't have a higherClassification field but just the ancestry fields (kingdom phylum class order family genus).

Yes, I created the higherClassification from those fields. It's kind of the same thing. It looks like the EOL name matching algorithm uses information from these fields, resulting in failure to match when the higher classification in the fields is wrong. So it looks like TreatmentBank is the source of all these incorrect taxonomies. Before we try to mitigate this on our end, let me report the problem to them. Maybe they can fix it.

KatjaSchulz commented 2 weeks ago

https://github.com/plazi/community/issues/291

eliagbayani commented 2 weeks ago

@jhammock @KatjaSchulz Here is the list of files we used in textmining using the Pensoft annotator. I placed them all in GitHub. I stopped using Google sheets early on as it is simpler to just maintain these files in GitHub. Our connectors now also use these TSV files for the respective lists.

KatjaSchulz commented 2 weeks ago

Thanks Eli, great stuff!

jhammock commented 1 week ago

@eliagbayani as plazi is taking action on Katja's error report, I think we might hold off on tackling the taxonomic issues until it's clear what will remain.

KatjaSchulz commented 1 week ago

They are only taking care of one of the taxonomic issues: the problematic higher classification data. The other taxonomic issues (missing taxa, higher taxa, misparsed & malformed names) need to be addressed on our end. I should probably make another report for the incorrect treatment parsing at source issue. I just have the one example, but maybe they have a way to look for more.

KatjaSchulz commented 1 week ago

https://github.com/plazi/community/issues/305

jhammock commented 1 week ago

Ah, Roger that. I'll let you two determine where Eli might as well be starting

eliagbayani commented 1 week ago

@KatjaSchulz Do I need to rescue names like this: 'Chloroclystis' (Rhinoprora) rufitincta (Warren 1898) link to its canonical simple: Chloroclystis rufitincta

another: 'Asthena' (Asthena) argyrorrhytes Prout, 1916 link

KatjaSchulz commented 1 week ago

Sigh, I wish people would stop mutilating names like this. If it's easy to do, it would be great if we could rescue them. But if it makes the code too complicated, we can also skip them.

EOL / ContentImport

TreatmentBank trait data adjustment #13

Trait data issues

Taxonomic issues