OpenTreeOfLife / treemachine

Source tree graph database
Other
16 stars 6 forks source link

Subtree methods need to return source references #163

Closed jar398 closed 8 years ago

jar398 commented 9 years ago

When someone uses the API to get phylogenetic information such as a subtree or subtended tree, it's important to relay the sources of that information, so that they can (a) check it (b) learn more (c) cite it. Sources are also important as for us to acknowledge the contribution (with gratitude).

This should be done compatibly, either with new methods returning both the tree and the sources, with a parameter specifying that both be returned instead of just the tree, or with separate methods that return just the sources.

josephwb commented 9 years ago

So, any source that touches any node in the returned tree? Source X may support one node in the returned tree, but reject a bunch of others. This could be confusing. For example, a microbes tree may disagree with the rooting of metazoa, but because it agrees with some trivial terminal clade it will be returned as a source for the whole tree.

Or am I making this harder than it should be? Just a list? Easy-peasy.

josephwb commented 9 years ago

Alternatively, node-specific supporting sources is possible, but could become a large file...

josephwb commented 9 years ago

Just to make it more complicated: are we interested in sources that support actual edges in the returned tree (i.e. source passes through both the parent and child node)? For sparse trees, there may be no such supporting sources (well, maybe taxonomy).

jar398 commented 9 years ago

What are we doing now for arguson?

On Fri, Feb 6, 2015 at 11:31 AM, Joseph W. Brown notifications@github.com wrote:

Just to make it more complicated: are we interested in sources that support actual edges in the returned tree (i.e. source passes through both the parent and child node)? For sparse trees, there may be no such supporting sources (well, maybe taxonomy).

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/163#issuecomment-73265117 .

jar398 commented 9 years ago

The idea is this: Any tree that's returned constitutes a set of claims about how evolution happened. The custom in science is to back up one's claims either with evidence or with a citation. So what are the publications that back up the claims? It's only necessary to give a sufficient set, not an exhaustive set. And yes, if taxonomy is all we have, that is what we say backs up the claims.

Edges are not claims; the claims are things like A and B are closer to one another than they are to C.

Jonathan

On Fri, Feb 6, 2015 at 11:31 AM, Joseph W. Brown notifications@github.com wrote:

Just to make it more complicated: are we interested in sources that support actual edges in the returned tree (i.e. source passes through both the parent and child node)? For sparse trees, there may be no such supporting sources (well, maybe taxonomy).

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/163#issuecomment-73265117 .

kcranston commented 8 years ago

Pinging this issue again. Does the new synthesis format make it easier to return sources?

jar398 commented 8 years ago

It sure should. This is what​ https://github.com/OpenTreeOfLife/opentree/wiki/Open-Tree-of-Life-APIs-v3#conflict-api-response-node-fields is about. The section claims to be about conflict but it is equally about support.

tm-lite has to ingest the annotations file in any case, so whenever it generates a tree, it can look up the support for every node in the subtree, finding any supported_by and partial_path_of annotations, which are marked with input trees.

josephwb commented 8 years ago

Please explicitly describe how you want these data presented.

josephwb commented 8 years ago

Are there design decisions made about this? Gathering the data is easy; how do you want it returned Arguson is a possible model.

jar398 commented 8 years ago

Design hasn't happened yet. I have assigned this issue to me and will hand it back to you when it's time to implement something.

kcranston commented 8 years ago

Pinging this issue again. It came up during the Phylotastic call today - they are returning OpenTree trees from the induced_subtree and subtree, and would like to provide a list of sources for users. For subtree with arguson, this info is already there, but not for subtree with newick, or for induced_subtree.

Couple of design questions:

jar398 commented 8 years ago

Yes, I think just a list of study ids as additional result, and then maybe we can have a separate OTI method that takes this list as input, and returns study metadata as output?

On Thu, May 5, 2016 at 6:13 PM, Karen Cranston notifications@github.com wrote:

Pinging this issue again. It came up during the Phylotastic call today - they are returning OpenTree trees from the induced_subtree and subtree, and would like to provide a list of sources for users. For subtree with arguson, this info is already there, but not for subtree with newick, or for induced_subtree.

Couple of design questions:

  • return the full support map, or simply a list of supporting studies? I lean slightly towards simply adding a second key to the returned json (something like supporting_studies) which returns a list of studies
  • we will need to return more than 'study_id@tree_id' for this information to be useful (implying a call to other APIs)

— You are receiving this because you were assigned. Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/treemachine/issues/163#issuecomment-217295123

kcranston commented 8 years ago

To clarify, which of the following are you suggesting:

  1. We return a list of studyIDs to the user, and then provide a separate (new) service that they can use to look up publication information for a list of studyIDs
  2. We perform this lookup before returning the data to the user, so that they get a list of publication references and / or DOIs with their subtree
jar398 commented 8 years ago

1.

jar398 commented 8 years ago
  1. Should the tree_of_life methods in question always return this extra information, or only when requested?
  2. Should the methods return a list of annotations, or a list of trees, or a list of studies?
    • Annotations: It's weird to return individual annotations (available through arguson) without indication of which node is annotated. Unprofitable complexity.
    • Trees: If a list of trees, we could reuse the source_id_map format, and that might simplify clients that already know how to process source id maps; but if client just wants a study list, handling a tree list is a burden.
    • Studies: List of study ids is pretty easy to process, but the client might care which tree(s) in the study matter.
  3. Which annotations should affect the result (annotation/tree/study list)?
    • supported_by - yes
    • partial_path_of - not sure. maybe not, as these only corroborate other sources ??
    • resolves - no (we don't care if the synth tree resolves a node in an input tree)
    • resolved_by - no (doesn't happen? node would have been incorporated in synth tree)
    • terminal - no, we are citing sources for the relationships they provide, not the taxa
    • conflicts_with - no

​As Joseph says the implementation is pretty straightforward once we decide exactly what we want. If it's not completely clear can we maybe get prospective users to weigh in?

kcranston commented 8 years ago

I am mostly concerned with providing a citation list along with a subtree so that data contributors get credit. People can request arguson if they want gory details. Also, given this use case, I think we should at least consider returning more information than our internal study identifier.

As for which types of annotations get included in the list? Definitely support, and definitely not terminal, resolve*, or conflict. Not sure about partial_path... maybe not?

jar398 commented 8 years ago

Treemachine doesn't have access to any 'more information'; only OTI has the DOI and reference. (well, and phylesystem.) Having a single service that returns both kinds of information could be done, but it's an architectural nightmare (errors, testing, configuration, deployment...) given the way things are designed now. Is two method calls really out of the question? They would simply be passing the list through, they wouldn't have to process it in any way. That is, I imagine a new OTI call that's specifically for this purpose.

jar398 commented 8 years ago

rather than make treemachine call out to OTI, I guess it could scan phylesystem, or load a file prepared for it by some script. that would work, but again makes things more fragile (installing peyotl, rerunning the script when a new tree is deployed, etc.)

kcranston commented 8 years ago

I am going to send an email to the opentreeoflife group to see what people think. We also want to implement this through the tree browser 'download subtree' link, where requiring a second call would be really awkward. (Although, I suppose that the browser already has the supporting list, so could add that to the download fairly easily).

jimallman commented 8 years ago

Yes, or it would be easy for the tree browser to fetch the main subtree, then fetch and incorporate more information.

jar398 commented 8 years ago

Waiting to hear back from @kcranston on the outcome of the consultation.

jar398 commented 8 years ago

Since the PR was posted for a while, and is now merged, I take it that the solution that I implemented is satisfactory. I'm closing the issue.

jar398 commented 8 years ago

Followon issue is here: https://github.com/OpenTreeOfLife/oti/issues/54