OpenTreeOfLife / muriqui

An annotation database for decorating nodes in trees
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

publish target definitions #15

Open kcranston opened 10 years ago

kcranston commented 10 years ago

The annotation DB will contain many different targets, and it would be interesting to be able to easily see / query those definitions. This could also potentially lead to re-use of definitions for the same clade - "you defined clade x as blah blah blah, and we also have these definitions that point to that same clade in the synthetic tree"

nfranz commented 10 years ago

Hi all:

To me (probably not just me), the "Open" also means (in addition to Open source, Open access), "Open (= indefinite) chain of revisions/updates of phylogenetic hypotheses". More or less.

I'm clearly not close to the end of my personal journey here, but am suggesting that the open-ended-ness can be brought out more clearly and consistently if some ways of speaking are used.

I was curious to learn last week (only) that DarwinCore defines the term "Taxon" (http://rs.tdwg.org/dwc/terms/index.htm#Taxon) as:

"The category of information pertaining to taxonomic names, taxon name usages, or taxon concepts."

Category of information..? I thought it was more like this: http://en.wikipedia.org/wiki/Taxon (= some natural entity of..stuff/processes "out there").

What I am getting at is this: there are two often intersecting contexts in which we use terms such as taxon, group, clade, monophylum, etc. The contexts do often intersect, but have slightly different flavors or domains of application. I'll call context 1 the "speak about taxa" context, and context 2 the the "represent taxon information" context.

Context 1 is I think how we usually talk to each other. There is an assumption, either openly acknowledged or at least enacted (sometimes hypocritically I suppose), that there are taxa are natural, causally/evolutionarily sustained entities "out there", that we have some epistemic access to their identity and boundaries, and generally it's ok to talk and act that way. (even though we are still "reconstructing")

Context 2 is how I think we ought to translate the very legitimate but abbreviated conventions of context 1 into an open system in the above sense. In this second context we might want to be closer to the DwC notion (however unintuitive when [e.g.] giving an evolutionary presentation). Taking this a bit further, then, in that second context we may only wish to speak about inter-subjective (human to human) mental representations and their interconnections. "Taxon" becomes "perceived taxon", "clade" becomes "inferred clade", etc.

Why all this? Because of the "same clade". And we all understand that phrase, but this way of speaking falls into context 1, right? It might be read to make claims about identity that do not hold at a more granular level of information representation in the OT environment.

I obviously think there is some value in an exercise where one does context 2 all the way. It's not that easy to get there, but presumably easier to then scale back and converge more on context 1 again. In context 2, strictly speaking, "same" has a very narrow meaning (same identifier in the database), and "taxon" or "clade" are not needed at all.

Accordingly, "same clade in the synthetic tree" (context 1) becomes:

"phylogenetically congruent clade hypotheses, published independently (hence with different identifiers), that make non-conflicting contributions to the topology of OT version X" (context 2).

Perhaps it is apparent how context 2 may inform annotation practices that are ultimately of value.

Hopefully I also managed to express that ascribing a full reality to taxa and not actually talking like that when it comes to information representation are actually compatible notions.

Sorry if this was TL-DR.

Best, Nico

On Tue, Oct 28, 2014 at 4:08 AM, Karen Cranston notifications@github.com wrote:

The annotation DB will contain many different targets, and it would be interesting to be able to easily see / query those definitions. This could also potentially lead to re-use of definitions for the same clade - "you defined clade x as blah blah blah, and we also have these definitions that point to that same clade in the synthetic tree"

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/muriqui/issues/15.

hlapp commented 10 years ago

This is being revised in the round of changes approved by the executive at TDWG 2014. -hilmar

On 10/28/14, 5:29 PM, Nico Franz wrote:

I was curious to learn last week (only) that DarwinCore defines the term "Taxon" (http://rs.tdwg.org/dwc/terms/index.htm#Taxon) as:

"The category of information pertaining to taxonomic names, taxon name usages, or taxon concepts."

Category of information..? I thought it was more like this: http://en.wikipedia.org/wiki/Taxon (= some natural entity of..stuff/processes "out there").

Hilmar Lapp -:- lappland.io

nfranz commented 10 years ago

Thanks, Hilmar.

Incidentally there is a related, upcoming Berkeley "BIGCB" workshop. I will try to document (blog/tweet) some of the talks and outcomes from this 3-day event.

http://taxonbytes.org/bigcb-workshop-at-uc-berkeley-tackling-the-taxon-concept-problem/

Nico

On Wed, Oct 29, 2014 at 8:14 AM, Hilmar Lapp notifications@github.com wrote:

This is being revised in the round of changes approved by the executive at TDWG 2014. -hilmar

On 10/28/14, 5:29 PM, Nico Franz wrote:

I was curious to learn last week (only) that DarwinCore defines the term "Taxon" (http://rs.tdwg.org/dwc/terms/index.htm#Taxon) as:

"The category of information pertaining to taxonomic names, taxon name usages, or taxon concepts."

Category of information..? I thought it was more like this: http://en.wikipedia.org/wiki/Taxon (= some natural entity of..stuff/processes "out there").

Hilmar Lapp -:- lappland.io

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/muriqui/issues/15#issuecomment-60941514 .

kcranston commented 10 years ago

Hi Nico, Thanks for the thoughts. I think - if I understand you correctly - that our idea of an annotation database separate from the synthetic tree supports the distinction between your concepts 1 and 2. When I say "same clade", I mean the node in a given version of the synthetic tree. The annotation targets (concepts) do not change, but their presence / absence / placement on the tree might from version to version. So, for a particular synthetic tree, we can then ask "how many of these annotation targets map to the same node in the tree?". By being flexible with how people can define the targets, we allow different taxon concepts for different use cases.

Does that make sense, or am I missing something? @jar398 might also have some input here.

mjy commented 10 years ago

Can someone provide examples of target definitions that are not defined in the phylocode? It seems to me that this real life use case is precisely that which the phylocode seeks to address.

jar398 commented 10 years ago

Example definition not defined by phylocode: anything that is character-based. For example, you may want to annotate a higher taxon in the taxonomy, e.g. Mammalia, under the assumption that (a) the character based definition is either well known or can be located in the literature, and (b) it is a clade.

I know some people don't like to admit claims like this, but they are common in biology (at least as hypotheses). But I think they qualify as an example, which is what you requested.

On Sat, Nov 1, 2014 at 8:58 AM, Matt notifications@github.com wrote:

Can someone provide examples of target definitions that are not defined in the phylocode? It seems to me that this real life use case is precisely that which the phylocode seeks to address.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/muriqui/issues/15#issuecomment-61367392 .

mjy commented 10 years ago

Some clarification- I was assuming (likely incorrectly) that target definitions were those being used in some computable manner, i.e. not asserted as in the nomenclatural pipeline. Since character data are not stored in OT I assumed this wasn't an option.

If definitions are simply user asserted annotations then why worry about target definitions, just make a generic "tagging" system and let users come up with sets of attributes (tags on tags) that they find useful?

jar398 commented 10 years ago

I find the annotation enterprise epistemologically troublesome. It would help me if I could see a list of use cases, i.e. actual claims that one would want to express and store.

I think it's very important to distinguish real-life clades, the physical / biological entities, from the information structures comprising the synthetic tree. So I am with Nico in warning against saying "clade" when you probably mean "node". If these aren't distinguished then, among other things, there is no way to speak rationally about how annotations are to be transferred from one tree to another.

To rigorously interpret a node N as a designator for a clade, you would have to do the following:

  1. Consider the set T of tip nodes of the synthetic tree.
  2. Interpret each tip node t in T as a clade cl(t).
  3. Let tips(N) = the "descendent" tips of N in the tree.
  4. Consider the clades that (a) contain cl(t) for all t in tips(N), AND (b) do not contain any of the clades cl(t) for t in T-tips(N). There could be many such clades, or there could be none. By (a) and the nature of cladeness, these clades will all either be or contain the real-life MRCA of cl(t) for t in tips(N).
  5. Select one of these clades as cl(N) = the one that N designates.

An annotation (a biological claim) expressed in relation to node N could be about any of these clades (4). The claim could even be vague on this point, saying it's about one of these clades but it's not known which. That's how a claim of an ancestral trait would be.

The hard part comes when we update the synthetic tree, i.e. we compute a new tree2 different from tree1. There may or may not be a node in tree2 that designates the same clade according to this formula, or the same range of clades. It is possible to set up a correspondence between trees with the property that equivalent nodes can consistently be taken to designate the same clade.

The problem case where N 'becomes paraphyletic' when we go from tree1 to tree2 is pretty obvious - N is not DCC-5 equivalent to any node in tree2. (This may or may not reflect N failing to designate a clade; it could just be a loss of resolution.) But another problem case is where a new node is 'inserted' as a sibling of N, i.e. N aligns with N' and parent(N) aligns with M' and M' is not parent(N'). Can you say whether a given N-annotation applies to N' or to its parent? Probably not.

Re 4(b), you can't just say that the annotation is about the MRCA of cl(t) for t in tips(N). The placement in tree2 of a tip that's not in tree1 in or out of N' could affect the support for the annotation. Similarly a tip that's in tree1 could be missing from tree2, effectively retracting any claim that it's in or not in cl(N'). So you have to pay attention to all the tips, not just the ones under the node of interest.

Now maybe I'm being rabid, and the annotations ought to be attached to the synthetic tree somewhere convenient without such nit-picky regard for logical soundness. If there is a reasoned argument for every annotation (e.g. via a citation), anyone can just go look at the original argument to figure out what it's saying exactly and whether it's consistent with any particular tree hypothesis, should there be any doubt. (The evidence for a claim - placed on the current synthetic tree - might even include a particular older version of the synthetic tree.) If the support and rationale are captured formally we even have a chance of reasoning about the claim using tools.

Jonathan

nfranz commented 10 years ago

Thank you, Karen (et al.).

I tend to think that the less abstract, the easier to understand and ultimately converge. The below example is likely not perfectly suited (and not "real") but maybe gets us closer.

Suppose that OT version 12 includes a section in its topology for Senecio, a genus (concept) of the daisy family (concept). The purported Senecio clade (genus, possibly some subgenera, then species) as shown in that OT version is entirely "grounded" in a single phylogeny resource (citation) submitted through PhyloGrafter. Say, "Pelser et al. 2007" are the authors of that phylogeny reference.

So then at time 0, maybe the annotation database ought to be able to say: "Version OT12 contains a clade Senecio with subsumed clades named ..., and this information is referenced to Pelser et al. 2007. Make annotations accordingly." Do I see three identifiers here or at least three groupings of data bits that amount to the following functionality?

  1. for "Senecio" (the name),
  2. for Senecio as circumscribed by Pelser et al. 2007 (who presumably get an ID too) (the concept), and
  3. for 2. as integrated specifically into OT12?

    Versioning scenario A, time 1. OT13 gets published, incorporating no changes at or under Senecio. Pelser et al. 2007 are still valid, and exclusively so. Annotations happen. I suppose they happen over "Senecio as circumscribed by Pelser et al. 2007 as integrated into OT13?". Maybe that last part (update to OT13) is not needed. Either way, I would personally prefer the cumbersome but maybe more clear context 2 speak here: OT12 and OT13 congruently include a concept of Senecio as circumscribed by Pelser et al. 2007. "Same clade" works well enough as a shorthand here though.

    Versioning scenario B, time 1. OT13 gets published, but now part but not all of Pelser et al.'s 2007 phylogenetic rendering as it pertains to the concept of Senecio is replaced (at lower levels) with concepts published in Watson et al. 2015.

    I think at that point, both in terms of speaking, and for the purpose of database and annotation structuring, we are increasingly in the context 2 realm, where mention of "taxon", "clade", "node" might be fruitfully adjusted to "taxon concept" (which as a taxon concept label: taxonomic/clade name + reference; see 2 above), "clade concept/hypothesis", "node concept/definition", and so forth. All of these effectively carry "according to's". At the database representation level, there is a fair bit of granularity. Lots of linking may have to happen. Much of that could likely be automated. Not all granularity and linking need be exposed very obviously to every user.

    Some of this is clearly academic. As I said, we humans tend to understand each other well either way. But I think too that OT has an opportunity to improve our community's syntactics and semantics as we explore open-ended tree hypotheses assembly systems. Put differently, if an arguably paradigmatic shift from a one-off culture of publication to an open-ended but credit-aware system did not also force us to revise our ways of speaking, wouldn't that be surprising? OT could be seen as generating a new environment of phylogenetic hypothesis linking that was not there before. Developing ways of speaking for that context need not negate the value of simpler ways of speaking in the more traditional contexts (like Pelser et al. 2007 in isolation), where provenance of concepts is readily inferred ("this publication, duh") and taxon and clade names map rather directly to a tree structure that is less of a composite than higher-level sections of OT versions might likely be.

    Hopefully a productive post, which is of great concern to me.

Nico

On Sat, Nov 1, 2014 at 2:09 AM, Karen Cranston notifications@github.com wrote:

Hi Nico, Thanks for the thoughts. I think - if I understand you correctly - that our idea of an annotation database separate from the synthetic tree supports the distinction between your concepts 1 and 2. When I say "same clade", I mean the node in a given version of the synthetic tree. The annotation targets (concepts) do not change, but their presence / absence / placement on the tree might from version to version. So, for a particular synthetic tree, we can then ask "how many of these annotation targets map to the same node in the tree?". By being flexible with how people can define the targets, we allow different taxon concepts for different use cases.

Does that make sense, or am I missing something? @jar398 https://github.com/jar398 might also have some input here.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/muriqui/issues/15#issuecomment-61362392 .

mjy commented 10 years ago

" It would help me if I could see a list of use cases, i.e. actual claims that one would want to express and store." <- Yes, this is likely the only way to really resolve this.

IMO there are 2 worlds. 1) clades are asserted to exist vs. 2) clades are calculated from data. IMO OT is completely within 1.

When you seek to persist clades/nodes across OTs, then you must ask, are you in world 1), or world 2)?

For arguments sake I claim that unless you begin to calculate on data (or, more broadly annotations on nodes), you will always be in 1). Regardless of who states what about concept T at time X, if you can't recalculate based on the data, you're stuck with an assertion. This was the basis of my original observation in this thread, i.e. what then can you do that is not doable as defined in the phylocode (or maybe the phylocode doesn't work, but let's assume it does)?

If OT agrees to happily exist in world 1) (which is just fine), then there are many things that are easily done without over complicating things. Entomologist don't think to themselves, "I've got this nagging doubt, humans just might be insects!". For many (all?) practical purposes they never, ever have to do this. They do real work, on a daily basis, without ever worrying about the definition of insects expanding to including humans. If they are doing an insect phylogeny they also don't have to worry about birds, lizards, fish, or squirrels. This suggests to me that there are real clades, that can be represented as nodes, in the OT, and that these can persist across versions. Do we need a robust logical framework for this level of assertion/claim? I love the idea, but maybe its over-engineering at some level.

M

nfranz commented 10 years ago

Thanks, Matt.

I think the point about the right amount of engineering is always well taken. (and my views in this thread are myopic on this one issue of a much larger undertaking)

Whether clade concepts are asserted (with data being "somewhere else") or inferred from data (provided and analyzed "right there in the same system") - one could still ask in each case how provenance might be tracked, at coarse or fine levels of granularity. Provenance can apply to direct evidence as well: http://onlinelibrary.wiley.com/doi/10.1111/j.1095-8312.2007.00847.x/abstract

I think (but may well be ignorant) the following example is challenging for the PhyloCode approach. You have two clade concepts, each with three children, at time 0:

0.clade1 with three children: 0.clade1_child1, 0.clade1_child2, 0.clade1_child3.

Then also 0.clade2 with three children: 0.clade2_child4, 0.clade2_child5, 0.clade2_child6.

Suppose that the PhyloCode node-based identity of 0.clade1 is set as the most recent common ancestor of 0.clade1_child1 and 0.clade1_child3.

Similarly, the identity of 0.clade2 is set as the 0.clade2_child4 and 0.clade2_child6 intersection (node).

At time = 1, new evidence/interpretation indicates that the respective phylogenetic positions of child2 and child5 should be "inverted"; so we obtain:

1.clade1 with children: 1.clade1_child1, 1.clade1_child5, 1.clade1_child3.

1.clade2 with children: 1.clade1_child4, 1.clade1_child2, 1.clade1_child6.

I believe under the PhyloCode application the clade definitions do not change. But we do have two taxonomies here at t = 0 versus t = 1 whose taxonomic content appears non-congruent at a more granular level. We could even presume that the stated synapomorphies of 0.clade1/1.clade1 and 0.clade2/1.clade2 are "the same" (= identical text strings; I would prefer to say they have congruent intensions). In the eyes of time = 1, the properties of child2 and child5 at time = 0 had been misdescribed.

Best, Nico

On Sat, Nov 1, 2014 at 1:36 PM, Matt notifications@github.com wrote:

" It would help me if I could see a list of use cases, i.e. actual claims that one would want to express and store." <- Yes, this is likely the only way to really resolve this.

IMO there are 2 worlds. 1) clades are asserted to exist vs. 2) clades are calculated from data. IMO OT is completely within 1.

When you seek to persist clades/nodes across OTs, then you must ask, are you in world 1), or world 2)?

For arguments sake I claim that unless you begin to calculate on data (or, more broadly annotations on nodes), you will always be in 1). Regardless of who states what about concept T at time X, if you can't recalculate based on the data, you're stuck with an assertion. This was the basis of my original observation in this thread, i.e. what then can you do that is not doable as defined in the phylocode (or maybe the phylocode doesn't work, but let's assume it does)?

If OT agrees to happily exist in world 1) (which is just fine), then there are many things that are easily done without over complicating things. Entomologist don't think to themselves, "I've got this nagging doubt, humans just might be insects!". For many (all?) practical purposes they never, ever have to do this. They do real work, on a daily basis, without ever worrying about the definition of insects expanding to including humans. If they are doing an insect phylogeny they also don't have to worry about birds, lizards, fish, or squirrels. This suggests to me that there are real clades, that can be represented as nodes, in the OT, and that these can persist across versions. Do we need a robust logical framework for this level of assertion/claim? I love the idea, but maybe its over-engineering at some level.

M

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/muriqui/issues/15#issuecomment-61382335 .

jar398 commented 10 years ago

On Sat, Nov 1, 2014 at 4:36 PM, Matt notifications@github.com wrote:

" It would help me if I could see a list of use cases, i.e. actual claims that one would want to express and store." <- Yes, this is likely the only way to really resolve this.

IMO there are 2 worlds. 1) clades are asserted to exist vs. 2) clades are calculated from data. IMO OT is completely within 1.

I think what you mean by 1) is real-life clades and what is true about them, and by 2) hypotheses about clades. IMO open tree, as an information artifact, is squarely about both. This is because some of what it records is true, and some of what it records is only hypothesized (could be true or not).

In case 2) sometimes there will be a clade to which the hypothesis (such as membership or traits) apply, and sometimes not.

Of course you don't have a claim that can be judged true or not unless you have semantics that tells you what your data structures and calculations mean. I think that is what we are talking about when it comes to annotations.

When you seek to persist clades/nodes across OTs, then you must ask, are you in world 1), or world 2)?

A clade will persist out in nature, unless we go out and make it extinct. What I'm talking about is how a claim related to a node in version 1 of the tree can be related to version 2 of the tree. So it's really about 'persistence' of access to claims through editions of the tree - regardless of whether the claim, or the tree, is true or not.

I think it's better to have the annotations not expressed in terms of any node in any edition of the synthetic tree, but rather to use phylocode or 'taxon concepts' to refer to clades. But if one wanted to interpret a node in the synthetic tree as a clade, what I'm saying is that it's not at all obvious how to do this. For me the obvious interpretation would be some clade that contains the tips or samples below that node, and does not contain the tips or samples not below that node, in that edition of the tree. But there may be many such clades, or no such clade.

For arguments sake I claim that unless you begin to calculate on data (or, more broadly annotations on nodes), you will always be in 1). Regardless of who states what about concept T at time X, if you can't recalculate based on the data, you're stuck with an assertion. This was the basis of my original observation in this thread, i.e. what then can you do that is not doable as defined in the phylocode (or maybe the phylocode doesn't work, but let's assume it does)?

I don't get this. Data would be support for a claim. So you're talking about whether claims are supported or not. Phylocode is not about data or claims, it's a way to refer to clades. You can use phylocode to make unsupported claims, or to make supported claims.

If OT agrees to happily exist in world 1) (which is just fine), then there are many things that are easily done without over complicating things. Entomologist don't think to themselves, "I've got this nagging doubt, humans just might be insects!". For many (all?) practical purposes they never, ever have to do this. They do real work, on a daily basis, without ever worrying about the definition of insects expanding to including humans. If they are doing an insect phylogeny they also don't have to worry about birds, lizards, fish, or squirrels. This suggests to me that there are real clades, that can be represented as nodes, in the OT, and that these can persist across versions. Do we need a robust logical framework for this level of assertion/claim? I love the idea, but maybe its over-engineering at some level.

The problem is not with the clear-cut cases like comparing humans to insects or lizards. The problems come when you add a primitive insect-like fossil and there is disagreement over whether it's an insect or not (i.e. whether insect annotations should apply to it); or when you have a name whose circumscription (or assignment to a clade, if there is one) 'changes' from one edition of a source taxonomy to the next; or when a newer analysis 'moves' a taxon into or out of a clade (referenced using phylocode). We have thousands of names whose definitions seriously conflict between source taxonomies - one taxonomy says that a set of tips grouped by another taxonomy is not a clade, and vice versa. (perhaps neither is right, but both can't be.) We don't have enough information to know whether in each case two different 'taxon concepts' were applied, if the same 'taxon concept' held but someone had a change of heart over whether a subgroup satisfied that 'taxon concept'. And when a curator assigns an OTU to a named taxon we don't know what's going through their head. It's keeping track of what to do in these cases - and harder, explaining after the fact how decisions were made so that errors can be tracked down and corrected - that requires careful thought and engineering. There will be analogous situations regarding phylogeny: you make an annotation on mrca(A,B) assuming that C is in the clade and it turns out that a better hypothesis is that C isn't in mrca(A,B), and maybe the annotation loses its support in that case. This is going to happen a lot; it's not a disaster, but it's going to be hard to know what's meant and to be transparent about what happened when we advanced to newer tree hypotheses. Without a way to explain how things ended up the way they are in the taxonomy or synthetic tree, what we're doing isn't science.

Here's another example I'm struggling with: there are currently a couple of species in OTT that are misclassified as crustaceans instead of molluscs. When we fix this problem, there will be an incompatible 'change' in the membership of Arthropoda. Does this mean that the new group should get a new identifier? - after all its identity in some sense has changed. If so, annotations and OTU mappings linked to the old id have no home in the tree. It doesn't get a new id with the current taxonomy generator, which assumes that names are tied uniquely to taxon concepts (with some exceptions), but with a more principled system where groups are defined by membership or phylogenetic hypotheses, it might. This would have an impact on OTU mappings and annotation carryover. I don't have a good answer to this one, but am working on ways to anchor the semantics of ids.

The problem of transparency for identifier semantics and annotation carryover is real, and has to be solved regardless of whether we decide to "overengineer" or not.

Jonathan

M

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/muriqui/issues/15#issuecomment-61382335 .

mjy commented 10 years ago

I think I likely confuse rather than add to this discussion, so I'll tiptoe away after this.

My world 1/2 distinction appears to confuse. IMO any proposed species, taxon, clade is a hypothesis, so this does not factor into my distinction b/w one and two. 1/2 is a pragmatic distinction, it's related to how hypotheses burst into existence, then get referenced later on. World 2) is about how species and clade hypothesis are ("originally") defined, it pertains to data derived directly from instances of the (ultimately hypothesized) species/clades. For example I might gather DNA, anatomical, and behavioral data and then run an algorithm on these data, based on those results I hypothesize the existence of taxa/clades. Later, in world 1) someone points to my hypothesis, assumes it's a good one, and does a new study. They do not compute on any of the data I used to define my original hypotheses. An OT in this 2) would necessarily reference specimens, and the data directly tied to those specimens. It would then define classes (clades/taxa) that classify those specimens based on the outcomes of the analysis of the underlying data. ( Supertree methods do not count as world 2, I assert this, rather than back it up).

OT could do something similar to what happens world two, but abstracted away from specimen data a layer or two. Annotations (= data that can used to define clades) can be added to OTUs. Clades can be defined as classes that are bound to a quantitative calculation on those annotations, i.e. they classify OTUs. In this scenario identifiers are provided for these clades, and from tree to tree they repopulate based on the data that is available. Their definition remains the same, a calculation/algorithm, therefor there is no need to change identifiers. Transparency is not an issue, when someone asks why X is in Y, you point to the algorithm that placed X in Y. Want to tweak the calculation that defines a clade? Mint a new identifier, it's demonstrably different because it actually calculates on data.

A final way of thinking about it. In OT there are publications, topologies, and taxa/clades. Now you want to add annotations to the system. The problem is that the system defines taxa/clades only via reference to a publication and topology. How can you expect a system to persist annotations on clades when the system does not define those clades based on those annotations?

arlin commented 10 years ago

On Nov 2, 2014, at 11:35 AM, Nico Franz notifications@github.com<mailto:notifications@github.com> wrote:

0.clade1 with three children: 0.clade1_child1, 0.clade1_child2, 0.clade1_child3.

Then also 0.clade2 with three children: 0.clade2_child4, 0.clade2_child5, 0.clade2_child6.

Nico, I like the approach of specifying an example of how relationships change. This discussion would be helped by a set of concrete examples of attributions and changes that reflect the kinds of problems that are likely to arise (maybe the 80% rule could work here).

To me, this raises the issue of why we want to traffic in clade concepts that purport to be stable by virtue of referring to an external reality, when in reality they reflect a limited view that is likely to change in the future. SFAIK it is generally agreed in the ontology world that ontological statements may be asserted as true based on the best available knowledge, even when they are hypotheses with some uncertainty (the whole issue of describing a conceptual world of hypotheses or posterior distributions is a separate matter). The problem with clades is just that the uncertainty is high enough, and they are so likely to change, that we are all here having an explicit discussion here about how knowledge can persist through these changes.

Is there some more generic way to assign attributes that sticks closer to the evidence?

Rather than pinning a label “blue” (for instance) on a clade based on some research publication, let's say that the publication assigns “blue" based on an ordered split, where the ingroup is { child1, child3 } and the outgroup is { child4, child6 }. This rule for assigning “blue" can persist, potentially through multiple tree topologies. In each case, we have to determine whether the topology is consistent with the split, and if so, how to apply the attribute.

We could back this up a further step— staying even closer to the evidence— and simply specify the method and evidence used in the research publication that attributes “blue” to clade1. We could say that “blue” is assigned by parsimony based on a particular distribution, e.g., ((((child1:blue,child2:blue),child4:purple),child5:red),child6:green). And again, we need a set of rules to know how to apply this when the topology is updated and when new members are added.

I don’t think there is any way to avoid the need to implement some kind of complex rule-based system, where the rules are based on phylogenetic logic.

Arlin

Suppose that the PhyloCode node-based identity of 0.clade1 is set as the most recent common ancestor of 0.clade1_child1 and 0.clade1_child3.

Similarly, the identity of 0.clade2 is set as the 0.clade2_child4 and 0.clade2_child6 intersection (node).

At time = 1, new evidence/interpretation indicates that the respective phylogenetic positions of child2 and child5 should be "inverted"; so we obtain:

1.clade1 with children: 1.clade1_child1, 1.clade1_child5, 1.clade1_child3.

1.clade2 with children: 1.clade1_child4, 1.clade1_child2, 1.clade1_child6.

I believe under the PhyloCode application the clade definitions do not change. But we do have two taxonomies here at t = 0 versus t = 1 whose taxonomic content appears non-congruent at a more granular level. We could even presume that the stated synapomorphies of 0.clade1/1.clade1 and 0.clade2/1.clade2 are "the same" (= identical text strings; I would prefer to say they have congruent intensions). In the eyes of time = 1, the properties of child2 and child5 at time = 0 had been misdescribed.

Best, Nico

On Sat, Nov 1, 2014 at 1:36 PM, Matt notifications@github.com<mailto:notifications@github.com> wrote:

" It would help me if I could see a list of use cases, i.e. actual claims that one would want to express and store." <- Yes, this is likely the only way to really resolve this.

IMO there are 2 worlds. 1) clades are asserted to exist vs. 2) clades are calculated from data. IMO OT is completely within 1.

When you seek to persist clades/nodes across OTs, then you must ask, are you in world 1), or world 2)?

For arguments sake I claim that unless you begin to calculate on data (or, more broadly annotations on nodes), you will always be in 1). Regardless of who states what about concept T at time X, if you can't recalculate based on the data, you're stuck with an assertion. This was the basis of my original observation in this thread, i.e. what then can you do that is not doable as defined in the phylocode (or maybe the phylocode doesn't work, but let's assume it does)?

If OT agrees to happily exist in world 1) (which is just fine), then there are many things that are easily done without over complicating things. Entomologist don't think to themselves, "I've got this nagging doubt, humans just might be insects!". For many (all?) practical purposes they never, ever have to do this. They do real work, on a daily basis, without ever worrying about the definition of insects expanding to including humans. If they are doing an insect phylogeny they also don't have to worry about birds, lizards, fish, or squirrels. This suggests to me that there are real clades, that can be represented as nodes, in the OT, and that these can persist across versions. Do we need a robust logical framework for this level of assertion/claim? I love the idea, but maybe its over-engineering at some level.

M

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/muriqui/issues/15#issuecomment-61382335 .

— Reply to this email directly or view it on GitHubhttps://github.com/OpenTreeOfLife/muriqui/issues/15#issuecomment-61413130.


Arlin Stoltzfus (arlin@umd.edumailto:arlin@umd.edu) Research Biologist, NIST; Fellow, IBBR; Adj. Assoc. Prof., UMCP IBBR, 9600 Gudelsky Drive, Rockville, MD, 20850 tel: 240 314 6208; web: www.molevol.orghttp://www.molevol.org

nfranz commented 10 years ago

Thanks, all, I am trying to keep up the momentum (as time permits).

I thought Jonathan's crustacean/mollusc example is neat. Possibly neat because it lays bare one's intuitions (should they exist) that identifiers ought to be able - to a degree - to do the following work for us:

  1. Parse out (syntactically?) new information elements entering the pre-existing OTT environment (= expand the database infrastructure in a bit-level sense), and quite finely so.
  2. Reflect identity in name (taxonomic/clade).
  3. Express, to a decent degree of resolution, taxonomic/phylogenetic equivalence, and the lack thereof.
  4. Maybe even - express, to some degree, how much has "really changed", and whether it "matters".

    I am not trying to set up a straw issue. I assume we will largely agree that, as a whole, this is asking too much from a single set of identifiers.

    But I do think that each of the above functions are tied to legitimate or at least worth-to-consider expectations. I did a little bit about Jonathan's example here: http://taxonbytes.org/taxonomic-concept-identification-reconciliation-open-tree-life-part-1/

    Intersecting with this issue for me is the question about the "right kind of logic". I tend to think the glass remains largely empty here. In particular, I personally do not find it readily obvious that we should have, in the context of evolving taxon/character concept hypotheses, a logic system implementation that stipulates the referent of a class or predicate as being constant in all possible worlds. There is an exchange about the OBO way of doing things by Smith and Merrill, alluded to here: http://www.applied-ontology.org/ontologicalrealism/ I tend to be in the Merrill camp; very stenographically -- domain needs require domain-specific conceptualizations of "identity". Bottom line, OT may well require new logic/representation development and implementation - I for one can't say for sure that it won't. If one tried to feed Jonathan's example into a standard Pizza-type ontology, I believe it would break the consistency for the reasoner.

    I think that leaves at least one more issue on the table - how can we express "residual congruence". This relates to Arlin's example of blue versus purple standing in for an overarching kind of evidence/partition, under which incongruent sets of taxon concepts can be variously accommodated without losing the sense of continuity. Working on this a little too (but nothing ready for showing yet). Typically this means, from a perspective of the kind of computational logic I am most familiar with, that a "coverage constraint" must be relaxed. Slides 117-131 here ( http://www.slideshare.net/taxonbytes/franz-2014-explaining-taxonomys-legacy-to-computers-how-and-why) illustrate the general point; one could substitute "PcarPeve_IC" for "male terminalia configured in a certain synapomorphic way"; and at a higher level 2006.PER and 2001.PER are congruent in spite of having non-congruent sets of children.

    Intuitively, we tend to think that property-centered definitions are more stable at higher levels. Annotations can likely bear that out. It doesn't necessarily follow (of course) that property-referencing concepts behave "fundamentally differently" from taxon concepts across OTT revisions.

Best, Nico

nfranz commented 10 years ago

http://taxonbytes.org/esa-2014-presentation-aligning-insect-phylogenies-perelleschus-and-other-cases/