CatalogueOfLife / general

The Catalogue of Life
49 stars 5 forks source link

How to detect chresonyms? #39

Open mdoering opened 6 years ago

mdoering commented 6 years ago

Some resources, e.g. the Reptile DB, contain many chresonyms for a name which the CoL would like to exclude. Manually flagging these names is very time consuming and not really feasable on this scale. What rules can we apply to discover the real name and flag chresonyms to discard them in the assembly process?

Mesibov commented 6 years ago

Hi, Markus.

You're very brave to try to tackle chresonyms on a "rules" basis, because a rule in the usual sense would have to refer to what librarians for many decades have called an "authority file" - an agreed standard listing, in this case of accepted names. Taxonomy has no authority file.

An alternative might be to allow variably sized clusters of related names, with the relations between them specified. Graph databasing is best for this, but it can also be done in a table. One of the related names could be designated "accepted" as an extra property, but that designation might change with taxonomic opinion. "Orthochresonymy" and "heterochresonymy" are relations and IMO more easily definable and changeable than "orthochresonym" and "heterochresonym" as entities.

You're also very brave to deal with the Reptile Database at all. Peter Uetz has put a lot of love and effort into it, but "Database" is a misnomer. He offers a checklist as an Excel file with numerous separate data items crowded into individual spreadsheet cells. The "dump" (the latest one I looked at is Dec 2014) contains two "tab-separated" text files which are structurally a mess (tabs and linefeeds) and which have an astonishingly high content of control and replacement character gibberish. These files are completely unusable without hours of rebuilding.

mjy commented 6 years ago

@mdoering I think we've hit this exact issue and worked a little on it the last couple of months. It arrose as we are working on moving existing Species Files into TaxonWorks. The latter uses a graph-representation, as @Mesibov notes would be useful, to store all its nomenclature, the former has a good number of rules, but it also allowed for free text nomenclators.

We have rules that facilitate matching nomeclators (species epithets as strings) against the "authority file", i.e. assertions that have been successfully translated into the graph. These sit as a middle layer between @dimus Biodiversity gem and the TaxonWorks model. I strongly suspect that you could greatly narrow down the list of chresonyms using a similar approach. Without the full semantics of a graph I think @Mesibov is right in many aspects, but you could eliminate a lot of manual work because you'll be treating each GSD as the authority, and playing it off against itself.

The middle layer library is here: https://github.com/SpeciesFileGroup/taxonworks/blob/development/lib/vendor/biodiversity.rb, I suspect it would be relatively trivial for you to translate it given what you seem to have available as documented in the new CoL API.

For reference the graph model is documented here- https://github.com/SpeciesFileGroup/taxonworks_doc/blob/master/concepts/TaxonWorksNomenclature.pdf. Sooooon we'll be translating all that into API doc like you've nicely done.

Mesibov commented 6 years ago

Hi, Matt.

Great to see the library in Ruby, I think I'm allergic to Java. The nomenclature graph also looks good, although I'm not sure how TaxonNameRelationship works? For Markus' benefit, mine (and anyone else interested), could you have a go at listing the non-overlapping relationships that might exist between names? (And why do I have the sneaking suspicion that Rich Pyle did this 20 years ago...?)

mjy commented 6 years ago

@Mesibov All the semantics come from NOMEN- https://github.com/SpeciesFileGroup/nomen, which we've completely hidden away in the interfaces. TaxonNameRelationships are object properties in OWL. Briefly, whenever you see an epithet, or any relation between monomials/protonyms you use a TNR to define that relationship (it really is a graph). TaxonNameClassifications are assertions (attributes) on a monomial/protonym.

Rich Pyle's work was definitely referenced in NOMEN, but we've worked out more technicalities (I think). Our model is also a "true" graph, nodes, edges, attributes on nodes- from my understanding we can traverse various aspects of this graph to reproduce RIch's model.

I should add that if you want to see it in action I'd be happy to set you up with a sandbox account.

mdoering commented 6 years ago

We based a lot of the CoL+ models on TCS especially the name relationship types: https://github.com/tdwg/tcs/blob/master/TCS101/v101.xsd#L683

See also the very useful guide with lots of examples: https://github.com/tdwg/tcs/blob/master/TCS101/UserGuidev_1.3.pdf

I wonder how well these relations map to NOMEN

mjy commented 6 years ago

To preface the spewing below- your question was about chresonyms, to me that clearly falls in the domain of Nomenclature, not TCS. YRMV.

My knee-jerk reaction is that NOMEN has nothing to do with taxon concepts, it's about the rules of nomenclature, therefor NOMEN is completely orthogonal to TCS (the intro to TCS clearly demarks these worlds). NOMEN allows you to make a set of assertions, and in theory infer with them, those assertions are not about biological concepts. Since TCS101 is about concepts, the two worlds of assertions do not overlapp. If one wants to infer the existence of concepts based on assertions that reference NOMEN that's upto them, but those inferences need to be recorded as such, likely as referenced in TCS.

I think you likely made the right decision to adopt TCS, while the CoL uses words like "synomym" they are a catalog of taxa (more specifically "species"), they don't, to my knowledge claim to represent names, but rather their tips are assertions of the existence of a biologically meaningful entity. Attempting to overload the TCS with rules of nomenclature in the context of the CoL may lead "Bad Things".

In TW we'll be focusing on adopting Nico Franz' approach to managing assertions about the relationships between biological entities (taxon concepts), primarily because at its heart it's a logical model that facilitates inference (as, in theory, does NOMEN). https://docs.google.com/document/d/1GpTJwrNoXjfV88Bupf4Lhx7JwzEFCrBdIVlJ0232zs8/edit?usp=sharing

In as much as the relationships b/w TCS and Euler has been worked out we'll support both worlds.

mdoering commented 6 years ago

quick notes, @mjy. TCS is deals clearly also with <TaxonName> not just <TaxonConcept>s even if it might do so less detailed than NOMEN. IPNI is actually exposed as TCS. Secondly I think chresonyms clearly fall into the taxon domain, they are concepts disguised as names.

mjy commented 6 years ago

@mdoering Point taken. @proceps and I will try and spend some time reconciling <TaxonName> with NOMEN.

Secondly I think chresonyms clearly fall into the taxon domain, they are concepts disguised as names.

:) If I had a dime for the number of times I've heard this said for all references to Names in general.

To me the heart of the issue is this- What can I possibly say about the biology of the taxon if all I know is Aus bus vs Aus (Bus) bus? Can we know if we are talking about a plant, animal, mouse or virus? Is bus the same thing? Who knows? We can guess, and as long as we annotate our data with "this is a guess" I'm OK with that representation. A chresonym (== Combination in TW) is a node without a name, it is defined by reference to other protonyms. It has its own unique ID that is independent of any of the protonym ids. One or more OTUs can be linked to that chresonym, those linkages are assertions by the curator, perhaps they can be inferred by inspecting the name string alone, but these inferrences are (very) limited in their "power", and the persistence of these types of inferences are not treated in NOMEN.

Really, I don't think it matters which perspective one takes, just make sure you have concepts in one box, names in the other, and clearly indicate where/how the two are linked. Practically this means that at the level of persistence there are unique IDs for concepts that must have no dependencies on names, i.e. the system must allow one to describe a concept with out a taxon name.

mjy commented 6 years ago

Had some conversations with others here. In retrospect I think the concept of chresonym I chased in the thread here was subtly different, and at times off-topic to what @mdoering is looking to resolve. So, take my spewing with a grain of salt.

In general I think my issue was that the way we (TW) represents data more or less eliminates the need to address this problem, but I was not thinking about the many ways others represent their data in a less refined (and perhaps therefor more easily conflicting) manner. It's those data that you're looking to resolve, true unknowns, or synonyms in the broad sense (like why are there 4 different authors for the same Aus bus). The conclusion remains somewhat similar, the quality of the solution will be bound by the nature of what you do actually know (the quality and scope of your protonyms), but the approach to resolving the problem is more nuanced then I was thinking.

dremsen commented 6 years ago

My introduction to chresonyms was first with the COL reptile data from Uetz but I got my teeth into them with Hershkovitz’ Catalog of Living Whales.

See https://www.biodiversitylibrary.org/item/33227#page/43/mode/1up https://www.biodiversitylibrary.org/item/33227#page/51/mode/1up

For purposes of improving recall, I wanted the synonymy (broad sense) but this required working out how to handle and model this both syntactically and semantically. In order to not offend the author I also wanted to be able to recreate the fidelity of the original. I subsequently came across many zoological catalogs that followed this general format.

mdoering commented 6 years ago

When looking at a synonymy full of Chresonyms we see as real input to the CoL I think we can flag likely chresonyms in many cases: http://reptile-database.reptarium.cz/species?genus=Aspidoscelis&species=tigris

You can easily recognize chresonyms here by the markup using the dash before the authorship. Unfortunately this is gone when the data gets to us.

The idea would be to first identify clear real names and then mark all "homonyms" with different authorships as chresonyms. If they were real later homonyms they would be heterotypic and unlikely appear in the synonymy. In the example above the accepted name and its basionym can be identified as clear good names. This leaves the rest of the entire first block to be potential chresonyms.

Also chresonyms never have a basionym authorship in brackets. So if there is a cluster of identical canonical names which includes a name with brackets all others are likely chresonyms:

Aspidoscelis tigris aethiops (COPE 1900)
Cnemidophorus tesselatus aethiops COPE 1900: 582
Cnemidophorus tigris aethiops SMITH & BURGER 1949: 282
Cnemidophorus tigris aethiops ZWEIFEL & NORRIS 1955
Cnemidophorus tigris aethiops MASLIN & SECOY 1986
Aspidoscelis tigris aethiops REEDER et al. 2002
Aspidoscelis tigris aethiops LINER & CASAS-ANDREU 2008
Cnemidophorus tigris aethiops SMITH & TAYLOR 1950: 189

You can spot Aspidoscelis tigris aethiops (COPE 1900) as a real name and therefore also its basionym Cnemidophorus tesselatus aethiops COPE 1900, leaving Cnemidophorus tigris aethiops SMITH & BURGER 1949 as a chresonym and all the other aethiops names too.

In the case of Cnemidophorus bacatus VAN DENBURGH & SLEVIN 1921 without basionym this is not possible and it needs manual input to select the one real name out of the pool of homonyms:

Cnemidophorus bacatus VAN DENBURGH & SLEVIN 1921
Cnemidophorus bacatus SMITH & TAYLOR 1950: 187
Cnemidophorus bacatus MASLIN & SECOY 1986
Aspidoscelis bacatus LINER & CASAS-ANDREU 2008
Aspidoscelis bacata JONES & LOVICH 2009
Aspidoscelis bacata ENDERSON et al. 2009
Aspidoscelis bacata JOHNSON et al. 2017