OpenTreeOfLife / reference-taxonomy

Open Tree Reference Taxonomy (OTT) tools
BSD 2-Clause "Simplified" License
11 stars 12 forks source link

Add direct links to EOL #114

Open jar398 opened 9 years ago

jar398 commented 9 years ago

Yan Wong via googlegroups.com to opentreeoflife Is there any move to producing a permanent mapping of the unique ottID to the unique Encyclopedia of Life page ID for a taxon? I've also asked this on the Encyclopedia of Life forums, in case they are thinking of placing ott IDs on their pages. At the moment I presume this is all done laboriously each time via a taxonomy name resolution service? I also assume that the ott IDs are mean to be permanently attached to taxa, regardless of the version of the OpenTree that is in use (though presumably some higher level taxa IDs may become obsolete with further revisions to the tree).

jar398 commented 9 years ago

Jonathan A Rees to opentreeoflife I've spoken with people at EOL about how to do this. It would be possible, but is a major undertaking and we don't have the resources to do it, given the relative priority of the task for the Open Tree project. It would involve processing many, maybe all, of their classifications. The database is quite large, and the schema and semantics would have to mastered before any code could be written.

If EOL were ever to synthesize a reference taxonomy of their own, it would be a much simpler matter to map it to OTT. But they haven't done this yet - of course they are strapped for resources too.

hyanwong commented 9 years ago

Might it be useful to go via NCBI or GBIF identifiers, where they exist? The NCBI/GBIF IDs could be got using https://github.com/OpenTreeOfLife/opentree/wiki/Open-Tree-of-Life-APIs#node_info, then mapped to EoL page IDs using http://eol.org/api/docs/search_by_provider. I don't know how many of the ott leaf nodes have an associated NCBI / GBIF ID, though.

jar398 commented 9 years ago

That's a great idea. I'll look into it.

hyanwong commented 9 years ago

I've just run a quick check. If you allow NCBI, GBI, IndexFungorum, & IRMNG, then that covers all but 707 of the 2313890 species-level leaves of the OpenTree. There's a small EoL bug which means that IRMNG isn't listed as a hierarchy provider, but I've just heard that they'll try to correct that.

hyanwong commented 9 years ago

OK - I've just heard back from EoL: strangely IRMNG is currently listed as "Extant & Habitat resource" using the API, so that means you can call http://eol.org/api/search_by_provider/1.0/12345.json?hierarchy_id=xxx to get the EoL page ID for the identifier 12345, where xxx = 1172 for NCBI identifiers, 800 for GBIF, 596 for Index Fungorum, and 1347 for IRMNG. That should cover 99.9% of OpenTree leaves.

hyanwong commented 9 years ago

Another problem: a taxon like https://tree.opentreeoflife.org/opentree/argus/ottol@2989884/Gnathophausia-childressi on the OpenTree has a GBIF and an IRMNG id, but if you look at those records, they are actually all from WoRMS. EoL only seems to bother saving the WoRMS ID (see http://eol.org/pages/4287771/overview), presumably to avoid duplication. So unless there is a way to convert (say) GBIF ids to WoRMS ids, which can then be used to query EoL, taxa like this may be missed.

jar398 commented 9 years ago

The next OTT version will have WoRMS in it, but that of course doesn't solve the problem for other taxonomies that feed GBIF. Unfortunately the GBIF dump doesn't provide the origin's id.​

hyanwong commented 9 years ago

From a cursory inspection, WoRMS is the main culprit. Cyndy at EOL thinks that they should have the GBIF IDs too, so it may just be an EOL oversight. If most are GBIF/WoRMS issues, then it would be possible to use the GBIF API to get the original WoRMS ids, see http://lists.gbif.org/pipermail/api-users/2015-January/000116.html

jar398 commented 9 years ago

A draft OTT 2.9 with WoRMS is available here:

http://files.opentreeoflife.org/ott/ott2.9/ott2.9draft3.tgz

jar398 commented 7 years ago

Took another look at this. @hyanwong's wikidata script has stopped working, but he pointed me at EOL's identifiers.csv file. It provides concordance with NCBI, WoRMS, and Index Fungorum, which is a start, but not GBIF or IRMNG. Very easy to work with, once you've reverse engineered the hierarchy ids. I've written the code to add EOL page ids to OTT nodes as if they were source taxon ids, and will make a PR soon. Maybe this can be in OTT 3.1.

hyanwong commented 7 years ago

IRMNG is 1347. Here's what I use for EoL hierarchy ids <=> OTT:

sources = OrderedDict((('ncbi', 1172), ('if', 596), ('worms', 123), ('irmng', 1347), ('gbif', 800)))

I guess that means you don't need a script from me.

mtholder commented 7 years ago

just chiming in to note Rod Page's recent post on wikidata parsing: http://iphylo.blogspot.com/2017/03/notes-for-wikicite-2017-wikispecies.html

hyanwong commented 7 years ago

He's not trying to parse wikidata. He's trying to parse wikispecies. A different beast altogether.

jar398 commented 7 years ago

So I have an OTT draft that has EOL ids in it.

But wow. We've got 13,000 EOL page ids that correspond to more than one OTT record. While some of these represent strains or other kinds of lumpings of closely related taxa, it looks to me as if most of them represent homonym or synonym errors either in OTT or in EOL. A gold mine! Not sure how to use this information.

E.g. EOL uses a single page id for both Annellas, but one is a fungus and the other a Cnidarian. OTT has separate records for Vemakylindrus costaricanus and Makrokylindrus costaricanus, but EOL has a single page id, so I suspect they're the same, because the epithet is the same and the genus name is similar.

hyanwong commented 7 years ago

Yes. For the synonym issues in EoL, you can simply take the lowest of the EoL page IDs (when two or more EoL pages are merged as synonyms eg. by hand or when their merging algorithm changes, the lowest id takes priority).

The problem is, of course, how to recognise the EOL mistaken synonymies. It would be nice to pass a list of these to EoL

jar398 commented 4 years ago

Looks like the current EOL 'provider identifiers' file is here? https://eol.org/data/provider_ids.csv.gz https://gitter.im/EOL/eol?at=5dbae2afa03ae1584f4106c7

TonyRees commented 4 years ago

Wikidata is maintaining some or many such mappings - not sure how reliably of course, although the principle "anyone can edit" presumably applies for fixing errors, etc. For example the list of "Taxon identifiers" as it appears at the bottom of https://en.wikipedia.org/wiki/Homo is automatically generated from Wikidata (https://www.wikidata.org/wiki/Q171283), I believe (also has included IRMNG IDs for the past 18 months or so). Although I see no OpenTree IDs ??? In case this helps...

hyanwong commented 4 years ago

I asked about OpenTree IDs on wikidata and was politely rebuffed. The EoL IDs on wikidata will need reworking, as many have changed. I'm not sure if I should prompt EoL to do this, prompt wikidata, or try to hack it myself somehow...

TonyRees commented 4 years ago

Jan Wong said: "I asked about OpenTree IDs on wikidata and was politely rebuffed". In the IRMNG case, it was proposed by Andy Mabbett (Pigsonthewing), as per here: https://www.wikidata.org/wiki/Wikidata:Property_proposal/IRMNG_taxon_ID

I do not see why you should have been "politely rebuffed". It might be worth dropping a line to Andy to get his take on this. The IRMNG proposal (which was made independently of myself) got 6 supporting votes apparently, all positive... I can't remember what triggered it at all now, although I did volunteer IRMNG content to Wikispecies as a useful resource a little while earlier.

hyanwong commented 4 years ago

The discussion is at https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Taxonomy/Archive/2015/10#OpenTree_IDs

TonyRees commented 4 years ago

Hmm... I guess a new case could be made, maybe. If I were assessing it, I would look at the permanence and profile of the project (is it staying around? Is it already, or likely to be in the future, well linked in to other Biodiversity Informatics initiatives?), also what new things does it bring to the table that are not already available elsewhere (e.g. if it is not just a "dumb aggregator", where is the value-adding, or novel content generation?). These are questions I cannot answer myself, but someone might be interested in making the case. Just a thought. Cheers - Tony