Create a csv dump of IDs - Githubissues

GlobalNamesArchitecture / BioGUID

BioGUID is a service for indexing and cross-linking identifiers for data objects within the realm of Biodiversity informatics.

http://bioguid.org

Creative Commons Zero v1.0 Universal

3 stars 0 forks source link

Create a csv dump of IDs #10

Open dimus opened 9 years ago

dimus commented 9 years ago

CSV dump can be used by others to experiment with distributed approach to matching IDs

hyanwong commented 8 years ago

Any ETA on this? I'm particularly interested in EoL page ID mapping to e.g. NCBI ids.

deepreef commented 8 years ago

I've been busy these past few months, but starting later this month/early next month, I plan to implement a number of improvements to BioGUID, including the CSV dump. Do you have identifier links already? Or would you like me to prioritize that particular set (EOL<->NCBI)?

hyanwong commented 8 years ago

Well, I have a specific aim to get a map of OpenTree of Life OTT IDs to EoL page IDs. But I can already get a mapping of OpenTree -> NBCI and OpenTree -> WoRMS and OpenTree -> IRMNG and OpenTree -> GBIF and OpenTree -> Index_fungorum IDs (from http://files.opentreeoflife.org/ott/), so it is the link from these sources (NCBI, IF, GBIF, etc) to EoL pages that I am missing. But that may be too specific a request to be useful to other users.

deepreef commented 8 years ago

Ah! OK. So, I guess I would suggest that I incorporate all the OpenTree -> NBCI and OpenTree -> WoRMS and OpenTree -> IRMNG and OpenTree -> GBIF and OpenTree -> Index_fungorum IDs to BioGUID first; then if any of those other ID's are already linked to EoL, then they will all likewise be linked to OpenTree. Moreover, going forward, linking EOL ID's to ANY of the other ones that OpenTree is already linekd to will automatically ensure that the OpenTree is linked to EOL as well. This is exactly the sort of Use Case that BioGUID was intended to support. In fact, I think I will make this my first priority task (i.e., harvesting the OpenTree IDs to all the ones already linked), then follow that up with an effort to link EOL to any of the others. So.. where will I find the OpenTree->XXX links within http://files.opentreeoflife.org/ott/?

deepreef commented 8 years ago

OK, I just downloaded/imported the OTT dataset (v2.9), and I see now where the identifiers are stored. I'll parse these out and incorporate them into BioGUID within the next couple of weeks, then look into ways of getting EoL cross-links incorporated as well.

hyanwong commented 8 years ago

Great! In the taxonomy.tsv file in the download link I sent, there is a column named 'sourceinfo' with entries like:

ncbi:10239,gbif:8,irmng:19

The first column, labelled uid, is the OTT (Open Tree Taxonomy) ID.

hyanwong commented 8 years ago

Are you also aware of the wikidata Q ids, e.g. https://www.wikidata.org/wiki/Q2267046. If you can map those somehow too, then they immediately provide links to wikipedia pages for taxa in different languages, categories of media files for taxa, etc.

deepreef commented 8 years ago

I don't have links to Wikidata yet; do you know if they have a way to bulk download their identifiers mapped to genus/species/etc. and their own identifier cross-links (ITIS, EoL, Worms, GBIF, Dynataxa)?

hyanwong commented 8 years ago

Yes: you download their JSON dump and parse it. I'll post some code to do this.

deepreef commented 8 years ago

One name at a time or in bulk?

hyanwong commented 8 years ago

It runs on the bulk 8gb JSON download file, which has 1 record per line.

deepreef commented 8 years ago

Excellent! Link?

hyanwong commented 8 years ago

Just modifying my script for you.

By the way, note that there are a few entries in the OTT taxonomy.tsv file which have 2 (different) ncbi IDs. These are (always?) ones that have an NCBI id immediately followed by a silva ID, and then later in the list, another ncbi ID. The open tree people have pointed out that these are cases where the first NCBI id has been derived indirectly, via SILVA (https://groups.google.com/d/msg/opentreeoflife/L2x3Ond16c4/CVp6msiiCgAJ), and in these cases, I have noticed that the first ncbi ID is often wrong: I reckon that first NCBI number can probably be ignored if there is another alternative.

hyanwong commented 8 years ago

Attached a rough python script. I haven't tested it yet, and it will probably need debugging, but it should give you the general idea.

Yan

get_wikidata_taxonQid.py.zip

hyanwong commented 8 years ago

Hope the script is helpful. I just realised that I didn't get it to actually print out the Wikidata QID for each line (doh), but I guess you can fix that easily. Is there any ETA for a csv matching file? Especially one with Encyclopedia of Life <=> OpenTree IDs? I don't know if @dimus has provided an EoL ID dump to BioGUID yet? So don't know how possible this is?

EIonv commented 2 years ago

create csv files with pi to easily find unassigned