CatalogueOfLife / backend

Complete backend of COL ChecklistBank
Apache License 2.0
15 stars 11 forks source link

Support Newick format #1060

Open mdoering opened 3 years ago

mdoering commented 3 years ago

Support for the Newick format in imports and exports seems a useful thing to do, especially for phylogenetic data.

thomasstjerne commented 3 years ago

Could the Newick export maybe include an additional tree with only taxon ids? Instead of only names as labels.

Background: The Newick format is somewhat limited with regards of how much metadata a node can contain, and there are various 'hacks' to include more data pr node like suffixing the id to the underscore delimited taxon label like Gavialis_gangeticus_3FFQ3. In order to make visualizations with proper links and maybe paging of children etc, a tree with only taxon ids is a good solution. An additional csv of labels (Taxon names) could maybe be provided alongside, but it could also just be fetched from the API.

mdoering commented 3 years ago

Thats mostly easy to do and as we implement the extended Newick this allows us to share more metadata per node. We can even include custom keys. I just see that exports use the simple Newick, I am changing this to extended which includes:

Otherwise the label/tag is documented to be the name of this node/clade and maps to <name>(<clade>) in PhyloXML. This suggests to me the current behavior is better than using the taxonID?

For linking the use of the ND tag seems best, a problem might be excaping. Labels in general cannot contain any of the following reserved chars: ()[],:; which we all replace with underscores. I am a bit unclear whether these reserved chars are allowed inside the extended metadata, especially when single quoting is used.

Interestingly the default example in this viewer resolves NCBI taxon ids used as labels: http://etetoolkit.org/treeview/

mdoering commented 3 years ago

From testing with various parsers and tree visualisers the reserved chars are also all reserved inside the comments and thus the extended metadata...

thomasstjerne commented 3 years ago

Interestingly the default example in this viewer resolves NCBI taxon ids used as labels: http://etetoolkit.org/treeview/

Yes, and there are no taxon names / labels in the example - these are probably given in an .tsv file or retrieved from and API. In the OTL download, both versions are given as labelled_supertree.tre (only node ids) and labelled_supertree_ottnames.tre (taxon names suffixed by node ids). For the OTL import to CLB, I unsuccessfully goofed around with the tree including taxon names a while, before I switched to the ID-only version, assisted by the appropriate version of the OTL Taxonomy tsv.

mdoering commented 3 years ago

Supporting both versions is not great. We could do that, but how would imports work? Not that they are implemented yet, but I think we should be able to eat what we produce.