EOL / ContentImport

A placeholder for DATA tickets everytime Jira is un-available.
1 stars 1 forks source link

TreatmentBank trait data adjustment #13

Open KatjaSchulz opened 3 weeks ago

KatjaSchulz commented 3 weeks ago

For discussion.

Trait data issues

Exclude term matches in references, citations, taxon names

Many invalid trait data records seem to come from term matches in references or citations or names of collectors. We should try to exclude all references & citations from trait parsing. @eliagbayani If I remember correctly, we excluded references for the NLP keyword parsing experiments, so there should be some code we could re-use from that project? I don't remember if we also excluded citations (e.g. Ribeiro 1964: 5; Crosnier and Forest 1965) but those and the names of collectors (usually preceded by legit or coll.) we should also be able to catch based on simple rules. Examples:

Some invalid trait data records come from term matches in taxon names. We should try to exclude both scientific & vernacular names from trait parsing. Examples:

Habitat values from place names

I found quite a few invalid habitat mappings due to place names including a string matched to a habitat term. Sometimes such matches can provide good mappings, often they don't. We may be able to rescue some of these by applying a marine/terrestrial filter, but even then we probably leave many spurious data records in place. I don't see an easy way to filter out place names altogether. Some of them seem to be marked up in the treatments, and we should definitely exclude those from parsing for habitat data, but many are not recognized at the source. Examples:

Problematic or ambiguous geographic terms

We should probably start a list somewhere of dangerous geo terms that need some kind of a confirmation/disambiguation effort.

Suboptimal trait label matching

If there are multiple term matches, it would be good if we would only use the longest possible substring match. This seems to work in a lot of cases already, but there are exceptions. Examples:

Taxonomic issues

Missing taxa

There are 7393 taxa with trait data that are not represented in the taxon file. See missingTaxonIDs.txt attached. If we cannot get the taxonomic data for these records, we should remove them.

Higher taxa

Please remove all records for taxa that are NOT of rank species|variety|subspecies|form. There are over 90,000 of these records. Most of them are mismapped, i.e., the trait record is attached to a genus or family or worse, but the matched value is actually providing information for a species that is not picked up by the parser. Examples:

Misparsed or malformed names

Please remove all records for taxa that have one the following strings in their scientific name values (case insensitive):

Please remove all records for:

This should accommodate cases where the scientificName value includes things like a subgenus name, a var./ssp./f. abbreviation or a hybrid character. Examples:

This should automatically take care of malformed names that cannot be rescued like:

There are some misparsed/malformed names that we should try to rescue:

Incorrect taxonomies

Over a thousand taxa have incorrect higher classification data leading to page mismappings. I found some misplaced taxa where the higher classification makes sense, just not for the taxa on question. But then there are also completely mixed-up higher classifications with crossovers between kingdoms or other major groups. These are not controversial or out-dated taxonomies. They are just plain wrong. We should try to fix all of these or remove the higher classification data if we cannot confidently resolve them. @eliagbayani where do you get the higher classification data for these taxa?

Misplaced taxa - these are hard to spot. There may be many more. Examples:

Hierarchy mix-ups - I found a bunch of them just by scanning higher classification data in the taxon file. I am attaching a file (invalidHierarchies.txt) with the mixed-up hierarchies I found. There are probably many more than I didn't notice. Examples:

Incorrect treatment parsing at source

I stumbled upon one instance where the treatments were not properly parsed at the source, so the treatment for a given species actually contains multiple treatments for several species. When we parsed for traits, the traits then got mapped to the wrong species. I'm not sure if this is just an isolated case or what we can do about it if it is more widespread. It's difficult to detect because there are no obvious trait data conflict as in the case of terrestrial taxa getting marine trait records. Examples:

KatjaSchulz commented 3 weeks ago

missingTaxonIDs.txt invalidHierarchies.txt

jhammock commented 3 weeks ago

Oh bother- sorry, Katja; I should have kept you better informed. Most of our large, non-branch-painted resources are currently riddled with ghost records. Jeremy was working on the delete-removed-records part of the harvest code recently but I don't think there's been any progress since the latest failed test.

Many, but not all, of your trait terminology issues above are now filtered out in most of those resource files, but they still manifest on the website. The "reef" issue remains, I think, though I want to say semi-overlapping tactics may have eliminated some of them.

You've also brought up a larger issue of strategy when rely on textmining, and particularly, areas to skip if we already have good coverage from another source -> I've never done this in an organized way, but it's a good idea. Actually, I never did find a good source of reef checklists. It's been regionally piecemeal, even among the fishes. Nevertheless, there are probably terms we should consider eliminating from the textmined data simply because we have them from another source.

And something I've tried to give Eli time for, but we all know how that goes :) -> organizing the textmining filter code, for easier re-use among resources. It's probably time all three of us compared notes about what is in use where...

jhammock commented 3 weeks ago

I'm starting to miss jira :/

Eli, it is TreatmentBank, isn't it, for which we had to remove some records with ancestry that mixed... plants and insects? As far as I can tell, @KatjaSchulz , some records combine data for multiple species (eg: the described species and its hosts) in that field. We filtered out one fairly narrow case, which I think was plants w/insects, but it sounds like it's worth creating a generalized filter for Incompatible Ancestors, using longer lists of names. I'm not sure how the computational weight of that check will scale with the number of names to compare...

I'm ok with just ditching records that fall into this category, unless one of you is keen on their salvage and thinks it'll be easier than I think...

JRice commented 3 weeks ago

Oh bother- sorry, Katja; I should have kept you better informed. Most of our large, non-branch-painted resources are currently riddled with ghost records. Jeremy was working on the delete-removed-records part of the harvest code recently but I don't think there's been any progress since the latest failed test.

Truth! I do think I've found a viable solution to the problem, but have not had the means to fully test it yet, and now I'm bogged down with admin problems. Mmmmmmaybe I'll make some headway tomorrow, but that's seeming less likely as the day continues. :S

Message ID: @.***>

KatjaSchulz commented 3 weeks ago

Just to be clear. All of the issues above manifest in the resource file. None of them are due to ghost records. I started out investigating things in the graph, but then switched to the resource file when I realized that what I was seeing in the graph didn't add up. I only relied on the graph when checking page mappings.

jhammock commented 3 weeks ago

Oh, thanks for that. I'm surprised, but it could be that I'm thinking of filters implemented in other resources. We certainly sequester non-target source text (including references) somewhere, and the place name / habitat issue has come up before also. We definitely need shared access to a record of methods used in each connector. Eli, I don't want to make that more work than necessary, especially for you. Do you have any ideas how such a record could be maintained? I've never tried to navigate your connector repository, but possibly github could be useful somehow? If they don't provide a clever automated assist for this task, I'd settle for a shared notes document listing each textmined resource and a sketchy description of the filters in place.

jhammock commented 3 weeks ago

FTR, @eliagbayani is out today. Eli, no rush. :)

KatjaSchulz commented 3 weeks ago

Yes, there's no hurry. We can start implementing things gradually. I'll have a another look and condense the above to an easy things to do right now list.

eliagbayani commented 3 weeks ago

Oh, thanks for that. I'm surprised, but it could be that I'm thinking of filters implemented in other resources. We certainly sequester non-target source text (including references) somewhere, and the place name / habitat issue has come up before also. We definitely need shared access to a record of methods used in each connector. Eli, I don't want to make that more work than necessary, especially for you. Do you have any ideas how such a record could be maintained? I've never tried to navigate your connector repository, but possibly github could be useful somehow? If they don't provide a clever automated assist for this task, I'd settle for a shared notes document listing each textmined resource and a sketchy description of the filters in place.

@jhammock I have a general library (in one place, most of it at least) for such filters but it is not in an English readable format nor in a shared easy-edit manner. I can start a shared notes (Google spreadsheet) and I will add to it everything I have in our connectors. And I will eventually also base all filter rules for our connectors in that spreadsheet. This answers the need for a common, accessible, editable place for all our filters. What do you think? Thanks.

jhammock commented 3 weeks ago

That sounds good to me!

eliagbayani commented 2 weeks ago

@KatjaSchulz TreatmentBank: This is the EOL resource: https://content.eol.org/resources/562 This is the OpenData record: https://opendata.eol.org/dataset/treatmentbank/resource/8eaf28c4-67eb-4eb0-931e-7ac8f89a87cf And the DwCA: https://editors.eol.org/eol_php_code/applications/content_server/resources/TreatmentBank_final.tar.gz DwCA was last generated: Jan 19, 2024

Just an initial comment on [Incorrect taxonomies] -> [Misplaced taxa] -> [Milnesium tardigradum] The taxon.tab has five (5) taxon entries for "Milnesium tardigradum". Each was given by TreatmentBank a taxonID and its own page. Two are Tardigrada and three are Annelida. All have MoF entries. For this resource (TreatmentBank), I did not compute the higher classification using let say a parentNameUsageID. I just copied the higher classification as it is. By the way, the taxon.tab doesn't have a higherClassification field but just the ancestry fields (kingdom phylum class order family genus).

Seems we need to fix this during the creation of the DwCA. Choose only the names we want? Thanks.

KatjaSchulz commented 2 weeks ago

@KatjaSchulz TreatmentBank: This is the EOL resource: https://content.eol.org/resources/562 This is the OpenData record: https://opendata.eol.org/dataset/treatmentbank/resource/8eaf28c4-67eb-4eb0-931e-7ac8f89a87cf And the DwCA: https://editors.eol.org/eol_php_code/applications/content_server/resources/TreatmentBank_final.tar.gz DwCA was last generated: Jan 19, 2024

Yes, these are the files I checked.

Just an initial comment on [Incorrect taxonomies] -> [Misplaced taxa] -> [Milnesium tardigradum] The taxon.tab has five (5) taxon entries for "Milnesium tardigradum". Each was given by TreatmentBank a taxonID and its own page. Two are Tardigrada and three are Annelida. All have MoF entries. For this resource (TreatmentBank), I did not compute the higher classification using let say a parentNameUsageID. I just copied the higher classification as it is. By the way, the taxon.tab doesn't have a higherClassification field but just the ancestry fields (kingdom phylum class order family genus).

Yes, I created the higherClassification from those fields. It's kind of the same thing. It looks like the EOL name matching algorithm uses information from these fields, resulting in failure to match when the higher classification in the fields is wrong. So it looks like TreatmentBank is the source of all these incorrect taxonomies. Before we try to mitigate this on our end, let me report the problem to them. Maybe they can fix it.

KatjaSchulz commented 2 weeks ago

https://github.com/plazi/community/issues/291

eliagbayani commented 2 weeks ago

@jhammock @KatjaSchulz Here is the list of files we used in textmining using the Pensoft annotator. I placed them all in GitHub. I stopped using Google sheets early on as it is simpler to just maintain these files in GitHub. Our connectors now also use these TSV files for the respective lists.

KatjaSchulz commented 2 weeks ago

Thanks Eli, great stuff!

jhammock commented 1 week ago

@eliagbayani as plazi is taking action on Katja's error report, I think we might hold off on tackling the taxonomic issues until it's clear what will remain.

KatjaSchulz commented 1 week ago

They are only taking care of one of the taxonomic issues: the problematic higher classification data. The other taxonomic issues (missing taxa, higher taxa, misparsed & malformed names) need to be addressed on our end. I should probably make another report for the incorrect treatment parsing at source issue. I just have the one example, but maybe they have a way to look for more.

KatjaSchulz commented 1 week ago

https://github.com/plazi/community/issues/305

jhammock commented 1 week ago

Ah, Roger that. I'll let you two determine where Eli might as well be starting

eliagbayani commented 1 week ago

@KatjaSchulz Do I need to rescue names like this: 'Chloroclystis' (Rhinoprora) rufitincta (Warren 1898) link to its canonical simple: Chloroclystis rufitincta

another: 'Asthena' (Asthena) argyrorrhytes Prout, 1916 link

KatjaSchulz commented 1 week ago

Sigh, I wish people would stop mutilating names like this. If it's easy to do, it would be great if we could rescue them. But if it makes the code too complicated, we can also skip them.