OpenTreeOfLife / reference-taxonomy

Open Tree Reference Taxonomy (OTT) tools
BSD 2-Clause "Simplified" License
11 stars 12 forks source link

map OTT Ids to dbpedia and ToLWeb pages #161

Open mtholder opened 9 years ago

mtholder commented 9 years ago

See also https://groups.google.com/forum/?fromgroups&hl=en#!topic/opentreeoflife/4jDcxryXWxg

jar398 commented 9 years ago

See also #114 , regarding EOL. Mapping to ToLWeb ought to be easy, if it's possible to obtain a comprehensive ToLWeb tree. Not sure how one would map to wikipedia, which doesn't provide a consistent taxonomy. Perhaps via wikidata, using lineage to disambiguate homonyms?

mtholder commented 9 years ago

wikipedia does have "taxoboxes" with structured syntax for a part of the hierarchy.

I do have a ToLWeb dump as XML, but there may be a more recent one...

kcranston commented 9 years ago

Link on the tolweb download page

lonelyjoeparker commented 9 years ago

Hi all

I've not got involved with this project directly yet, but in case it's relevant thought I'd mention that there's a big effort here at Kew around rationalisation / disambiguation / mapping of synonyms through the International Plant Names Index data (and other sources). I think the aim's to be 'comprehensive, but not authoritative' e.g. pulling everything in but not definitively ruling on taxonomy at least at first.

In particular there is a big project here brewing here called Plants of The World Online Portal (POWOP; still in alpha AFAIK but has been demoed to e.g. Doug & Pam Soltis etc) which I think might provide disambig services. @jiacona and @nickynicholson might well know more. A lot of POWOP is forked from eMonocot (github) I think.

lonelyjoeparker commented 9 years ago

(oops, make that @nickynicolson - sorry Nicky...)

hyanwong commented 9 years ago

I'm looking into this for a different project. The best way is to go via Wikidata, not wikipedia. For example, the wikidata entry for gorilla is at https://www.wikidata.org/wiki/Q36611. It doesn't have an OTT ID (see the discussion I started at https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Taxonomy#OpenTree_IDs, but it does have e.g. ITIS and NCBI IDs (as well as an EoL one, incidentally). So we should be able to take the entire wikidata JSON dump (https://www.wikidata.org/wiki/Wikidata:Database_download)(, use the NCBI/etc IDS to find the corresponding OTT id, then parse the JSON for wikipedia entries in different languages.

hyanwong commented 9 years ago

Update: the wikidata JSON dump is about 5GB zipped, 30GB+ unzipped. We need to run this through a JSON parser, preferably one that parses JSON streams rather than swallowing up the whole JSON into memory. I was thinking of using python to do this (see https://changelog.com/ijson-parse-streams-of-json-in-python/). Any thoughts?

jar398 commented 9 years ago

On Wed, Oct 7, 2015 at 9:37 AM, Yan Wong yan@pixie.org.uk wrote:

p.s. did you see my previous post about wikidata/pedia. Not sure if I should ask to get another field into the taxonomy.tsv file for the wikidata ID for each taxon (where available), which will allow OToL to do direct queries using the mediawiki API to display (language specific) wikipedia links as well as a good number of EoL page IDs.

Maybe Karen already said this but I think the mappings should be placed in separate files i.e. separate from the main taxonomy file. They could go in the taxonomy tarballs, or could be distributed separately. In any case we'd want to get this information integrated with the webapp and the TNRS (API), but that's independent of how it's represented.

I think the methods used for mapping wikidata, EOL, and Tolweb will probably all be different. There's already an issue for EOL (#114), but maybe a bit later we should consider making a separate issue for Tolweb.

If you're sure wikidata is the way to go, feel free to change the issue title to reflect that.

hyanwong commented 9 years ago

Wikidata is the way to go because each taxon has a unique ID, a parsable taxon_name, and NCBI etc IDs. That then gives us links to all the wikipedia entries, in multiple languages. No need to change the issue title, though.

If OToL ends up using GNresolver, then we might also want to put wikidata into GNresolver, I guess (I asked wikidata about producing a DwC-A but they aren't ready yet. That's all for another day.

Yes, it's sensible to put these mappings in a different file.

hyanwong commented 9 years ago

I have a script now. It finds 1327315 taxa mapped to a wikidata item out of the 3400000 or so in taxonomy.tsv. That's an excellent hit rate, I think. Will post more info soon

jar398 commented 9 years ago

see also discussion at https://groups.google.com/d/msg/opentreeoflife/L2x3Ond16c4/X-xEcnvJCgAJ

jar398 commented 9 years ago

I think the easiest way to implement these links is to add them as new sources in the sourceids column of the taxonomy. I would rather have them be separate but that creates a demand for more taxomachine and/or front-end tooling and I would rather avoid that.

@mtholder We should figure out what to do about Tolweb. If I have it in smasher taxonomy format, I can easily do the assignments (without adding any new taxa to OTT). Or, if you want to give me a csv, I can add the links that way. (what do the ids look like?)