Open dimus opened 9 years ago
Any ETA on this? I'm particularly interested in EoL page ID mapping to e.g. NCBI ids.
I've been busy these past few months, but starting later this month/early next month, I plan to implement a number of improvements to BioGUID, including the CSV dump. Do you have identifier links already? Or would you like me to prioritize that particular set (EOL<->NCBI)?
Well, I have a specific aim to get a map of OpenTree of Life OTT IDs to EoL page IDs. But I can already get a mapping of OpenTree -> NBCI and OpenTree -> WoRMS and OpenTree -> IRMNG and OpenTree -> GBIF and OpenTree -> Index_fungorum IDs (from http://files.opentreeoflife.org/ott/), so it is the link from these sources (NCBI, IF, GBIF, etc) to EoL pages that I am missing. But that may be too specific a request to be useful to other users.
Ah! OK. So, I guess I would suggest that I incorporate all the OpenTree -> NBCI and OpenTree -> WoRMS and OpenTree -> IRMNG and OpenTree -> GBIF and OpenTree -> Index_fungorum IDs to BioGUID first; then if any of those other ID's are already linked to EoL, then they will all likewise be linked to OpenTree. Moreover, going forward, linking EOL ID's to ANY of the other ones that OpenTree is already linekd to will automatically ensure that the OpenTree is linked to EOL as well. This is exactly the sort of Use Case that BioGUID was intended to support. In fact, I think I will make this my first priority task (i.e., harvesting the OpenTree IDs to all the ones already linked), then follow that up with an effort to link EOL to any of the others. So.. where will I find the OpenTree->XXX links within http://files.opentreeoflife.org/ott/?
OK, I just downloaded/imported the OTT dataset (v2.9), and I see now where the identifiers are stored. I'll parse these out and incorporate them into BioGUID within the next couple of weeks, then look into ways of getting EoL cross-links incorporated as well.
Great! In the taxonomy.tsv file in the download link I sent, there is a column named 'sourceinfo' with entries like:
ncbi:10239,gbif:8,irmng:19
The first column, labelled uid, is the OTT (Open Tree Taxonomy) ID.
Are you also aware of the wikidata Q ids, e.g. https://www.wikidata.org/wiki/Q2267046. If you can map those somehow too, then they immediately provide links to wikipedia pages for taxa in different languages, categories of media files for taxa, etc.
I don't have links to Wikidata yet; do you know if they have a way to bulk download their identifiers mapped to genus/species/etc. and their own identifier cross-links (ITIS, EoL, Worms, GBIF, Dynataxa)?
Yes: you download their JSON dump and parse it. I'll post some code to do this.
One name at a time or in bulk?
It runs on the bulk 8gb JSON download file, which has 1 record per line.
Excellent! Link?
Just modifying my script for you.
By the way, note that there are a few entries in the OTT taxonomy.tsv file which have 2 (different) ncbi IDs. These are (always?) ones that have an NCBI id immediately followed by a silva ID, and then later in the list, another ncbi ID. The open tree people have pointed out that these are cases where the first NCBI id has been derived indirectly, via SILVA (https://groups.google.com/d/msg/opentreeoflife/L2x3Ond16c4/CVp6msiiCgAJ), and in these cases, I have noticed that the first ncbi ID is often wrong: I reckon that first NCBI number can probably be ignored if there is another alternative.
Attached a rough python script. I haven't tested it yet, and it will probably need debugging, but it should give you the general idea.
Yan
Hope the script is helpful. I just realised that I didn't get it to actually print out the Wikidata QID for each line (doh), but I guess you can fix that easily. Is there any ETA for a csv matching file? Especially one with Encyclopedia of Life <=> OpenTree IDs? I don't know if @dimus has provided an EoL ID dump to BioGUID yet? So don't know how possible this is?
create csv files with pi to easily find unassigned
CSV dump can be used by others to experiment with distributed approach to matching IDs