CatalogueOfLife / coldp

33 stars 11 forks source link

Add tree branch length for phylogenies #55

Closed mdoering closed 3 years ago

mdoering commented 3 years ago

In order to support phylogenetic trees the main missing property is the branch or edge length of trees. It is called length in the Newick format:

Length: the length of a tree edge.

PhyloXML calls it branch_length.

Suggest to add a new field parentLength to the Taxon entity to indicate the length of the edge between a child and it's parent node. It will take a number (float).

The ability to indicate a unit of the length is likely important and should be added to the metadata.yaml as a new property. parentLengthUnit might be a good matching term.

mjy commented 3 years ago

This will have very little meaning unless it's broken out (many studies, many lengths), and linkable to a phylogenetic method. Perhaps a better solution would be to make a robust cross-link to OToL?

mdoering commented 3 years ago

One reason to have this is being able to import oToL

mjy commented 3 years ago

Still seems backwards. Would you ask oToL to import CoLDP data so they can export to CoLDP?

Edit- meh, sorry, I read "import" as "export to".

mdoering commented 3 years ago

Why does this have little meaning? Neither Newick, PhyloXML or NEXUS embeds information on methods. Thats a separate thing for which we should use metadata.

mjy commented 3 years ago

Perhaps I'm not understanding the values that will go in this field?

mjy commented 3 years ago

I'm suggesting that passing along something like '0.22' without interpretation of a) how that value was calculated, and b) the actuall units of that number, is basically pointless, no two datasets can be confidently compared, can they?

mdoering commented 3 years ago

They can't, but should they? You can still generate trees for them and they make sense on their own. Plus the metadata should tell you how these numbers were created and can point to further publications. I don't see an issue.

mjy commented 3 years ago

I'm inferring that the primary use case is "I want to draw trees on CoL"?

But maybe I missed how the metadata is linked. If you're implying a general list of references with Taxon then you're asking the user to go do their own research, it's somwhere in here. That's fine, there isn't much argument against including any field you want if that is the case.

mdoering commented 3 years ago

branchLength might be a better name as its known. Docs would explain it is the branch to the parent.

thomasstjerne commented 3 years ago

+1 for branchLength. Currently OTL leaves out branch lengths in the synthesized tree because they are hard to standardize across the studies. But my understanding is they are working on this for a future release.

mjy commented 3 years ago

Currently OTL leaves out branch lengths in the synthesized tree because they are hard to standardize across the studies. But my understanding is they are working on this for a future release.

Might this suggest that the issues I allude to are real, and that CoL should not just jump in and add this field?

dhobern commented 3 years ago

@mjy - I think the issues are clear and well understood. However, it certainly makes sense for COL ChecklistBank to accommodate branch length where included for any datasets it imports. These lengths will only be meaningful in the context of that dataset, but it will be much easier to get them early rather than expect data publishers to add them later.

mjy commented 3 years ago

@dhobern Having published a re-analysis of a phylogenetic study trying to replicate the author's result, even given the full data-set in supplementary material, I feel I understand the nuances of trying to interpret this type of data. Executing that study was a nightmare, I can't imagine what use taking a single data-type from a tossed-in field provides. If you don't have metadata (what tree, there are many in most publications, what analysis method, what version of the software was used, where is the data-set that was used to calculate the tree lengths, what is the model that was used) you're not helping anyone do anything meaningful analysis-wise downstream. If you do have that data... then maybe your are future Open Tree.

So, I guess we'll have to respectfully disagree on this one.

dhobern commented 3 years ago

Thanks @mjy - I completely agree that we should not be seduced into thinking this supports downstream analysis without at least also focusing on a number of metadata aspects. I'm only seeing this as a way to preserve and visualise the outputs from the source studies to visualise their basis for asserting a hierarchy. All further interpretation would be via redirection to the source. This is not directly to support meta-analysis or synthesis. In a sense it's just metadata to justify the supplied list. We should require at least enough additional metadata to locate the source analysis.

mdoering commented 3 years ago

And isn't branchLength just another result of a study than the tree itself? ChecklistBank (CLB) is not COL. This is probably a common misconception. CLB primarily just tries to host datasets and represent them faithfully, but in a standardised view. And branchLength seems like an important feature of phylogenetic datasets we should preserve. It will not make it into COL in any way.

mjy commented 3 years ago

@mdoering the repo is named CatalogueOfLife/coldp. Sorry for my confusion reflecting the generalization to checklist bank. I suspect it will take quite a while for everyone to understand the broader borgification going on, including those like myself who are more aware.