dump names list with URIs for sources

chinchliff commented 11 years ago

For compatibility with clients, etc. such as phylotastic TNRS (Naim Matasci?), other things interested in the sources of our taxonomic names.

The specifics of the format are unclear. We would want to provide at a minimum, for all names in OTToL:

the taxon name
the source(s)

Presumably we want this name dump to be versioned. We presumably want to build the list each time we (re)build the taxonomy graph, and version it as such.

jar398 commented 11 years ago

Just a thought but I'm not sure the whole URL is the best representation. Consider http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=2759 The prefix may vary depending on NCBI's whims at the time (its URLs are not necessarily stable), or the needs of the consuming software. Might be better to use a simple CURIE such as ncbi:2759 or a pair {"ncbi", 2759} which can be recognized, taken apart, and put back together more easily. Then we can document what we mean by the "ncbi", and software consuming our data can process the names according to its preferences.

rhr commented 11 years ago

At the Chicago meeting we talked about having NCBI-style csv dumps, e.g., having separate files for nodes in the hierarchy and names.

chinchliff commented 11 years ago

I am not sure how a node dump would be the same or different from what we are talking about here.

JAR, re: the URL comments, I agree. One of the uses of this name dump will be for phylotastic to know what is in our database, so we have proposed a simple JSON format for the name dump, available here:

https://gist.github.com/4675090

And a description of this format and how we should export these data is here, under the heading "Querying Treestores" (might have to scroll down a bit to find it):

https://docs.google.com/folder/d/0Bw-1ley90MKnaXZ1VHp4SWhJeFE/edit?docId=1HJv0hrnyldsY9R5IIxU-YSjNMwKWB7KU3zaNS2qDZLs

chinchliff commented 11 years ago

I have started implementing the name dump. Currently all we have are ncbi UIDs. If/when we start storing gbif/ottol UIDs (or UIDs for other sources that are provided by GBIF) in the same way we store the ncbi uids, it will draw those out as well.

jar398 commented 11 years ago

On Wed, Jan 30, 2013 at 3:32 PM, chinchliff notifications@github.comwrote:

I am not sure how a node dump would be the same or different from what we are talking about here.

JAR, re: the URL comments, I agree. One of the uses of this name dump will be for phylotastic to know what is in our database, so we have proposed a simple JSON format for the name dump, available here:

https://gist.github.com/4675090

I don't understand why we need either the treestoreId or the ottol id. I propose we use a prefixed sourceId as the only id for a name. Creating new names when unnecessary is always a bad idea since the new names have to be explained, maintained, resolved, etc. and it's always better to get someone else to do this work for you.

"names":[ { "name":"example name 0", "treestoreId":"ncbi:12345", "sourceIds":{ "ncbi":"12345", "gbif":"56789" } },

or even just

"names":[ { "name":"example name 0", "sourceIds":[ "ncbi:12345", "gbif:56789" ] },

And a description of this format and how we should export these data is

here, under the heading "Querying Treestores" (might have to scroll down a bit to find it):

https://docs.google.com/folder/d/0Bw-1ley90MKnaXZ1VHp4SWhJeFE/edit?docId=1HJv0hrnyldsY9R5IIxU-YSjNMwKWB7KU3zaNS2qDZLs

— Reply to this email directly or view it on GitHubhttps://github.com/OpenTreeOfLife/opentree-taxomachine/issues/8#issuecomment-12910780.

jar398 commented 11 years ago

That is, use ncbi:12345 whenever we might have been tempted to refer to an ottol id. Don't create a fresh ottol id in the first place, just re-use someone else's id as the ottol id. Inventing ids for things that already have ids serves no purpose that I can discern.

chinchliff commented 11 years ago

Makes perfect sense to me. The advantage to having a treestoreId is that it allows a simple way for clients (e.g. phylotastic) to reference names. But I can't see any reason why it shouldn't be something someone else made up. So, "ncbi:1234" format seems fine to me. Externally, these show up as strings. Internally, we can parse the bits and reference the nodes using just the source id itself.

The "ottol ids" crept in there by mistake. They've been removed.

kcranston commented 11 years ago

The current version is missing a set of braces so that the format is {'metadata':'everything else'}

chinchliff commented 10 years ago

I think this entire thread is obsolete; smasher now handles taxonomy versioning and its output is a comprehensive dump of names, synonyms, etc. It does not seem to make sense to have this duplicated in taxomachine.

OpenTreeOfLife / taxomachine

dump names list with URIs for sources #8