biocore / empress

A fast and scalable phylogenetic tree viewer for microbiome data analysis
BSD 3-Clause "New" or "Revised" License
48 stars 31 forks source link

Respect ancestors in feature metadata coloring/propagation #473

Closed fedarko closed 3 years ago

fedarko commented 3 years ago

I brought this up in https://github.com/biocore/empress/issues/471#issuecomment-749300846, but for the sake of everyone's sanity this should probably be its own issue. This isn't a very pressing issue (the only common case it should run into is stuff like s__ for Level 7), but there are corner-cases where real levels of taxa can have the same name but different ancestors. In these cases, we should assign these values distinct colors and not treat them as "equal" when doing propagation.

As a sidenote: this discussion brings up the mildly wonky point that, currently, EMPress treats each feature metadata field (including the various levels of taxonomy) as its own independent thing, ignoring other metadata fields. This means that, for example, if you color by Level 7 (species) in a 16S dataset using the default QIIME color map, you'll probably see a lot of clades of the tree colored as red due to all of the tips in the clade sharing a species classification of s__, even if they're from different genera/families/etc:

yike

Addressing this would definitely be possible, by for example representing the values in each Level N string as the full taxonomy to that point (e.g. setting Level 7 to k__Bacteria; p__Firmicutes; c__Somecoolclass; o__Ogeezimrunningoutoftaxonomynamesiknow; f__Isanyonereadingthis; g__Himom; s__ instead of just s__) -- in some ways this is similar to a point @antgonza raised a few weeks ago in #422.

ElDeveloper commented 3 years ago

Hierarchical metadata should certainly be treated differently. This makes sense although I think we'll need to "special case" this to columns labeled as Level * right? And also be very clear to the user about what's going on.

fedarko commented 3 years ago

Agreed. What should help a bit with the special casing is that the Level * columns are created by EMPress' Python code (in this module), so we should be able to pass some information about these columns specifically to the JS code saying something like "Hey this is a taxonomy metadata column, look at its ancestors". (I don't think it will be feasible to detect already-split-up taxonomy columns and try to disambiguate them -- that sounds like it would be challenging and prone to errors.)

One solution that should work, with the added benefit of only modifying the Python code: we automatically could go through the split-up taxonomy (produced by the code linked above) and identify all of the ambiguous values in each level (so both the s__ cases in Level 7, and also stuff like g__ in Level 6, Unspecified at any level, etc.) We should then be able to add text to these values (from the higher levels) to disambiguate these levels as needed: so, for example, for Level 7 this would probably involve prefixing each s__ with the respective genus label, producing things like g__Streptococcus; s__ as the metadata values where needed. This would cause distinct colors to be assigned to each taxonomy value, as well as avoid ambiguous values from being treated as equal during propagation. (This would increase the metadata size somewhat, but speedups like #337 should offset this at least partially.)