Barplots in the circular / rectangular layout

fedarko commented 4 years ago

This was mentioned in #97 (which has since been closed, since the focus of that was on the circular layout).

Now that the circular layout is implemented and tested, supporting visualizing tip-level feature metadata as barplots would be a really cool feature to add. This could be useful for a few different types of feature metadata, ranging from Songbird/ALDEx2/... differentials (or other "importance scores") to taxonomy annotation confidence values, etc.

fedarko commented 4 years ago

Also, it'd be cool to optionally support visualizing information passed over from Emperor as barplots -- it could be really useful to see e.g. presence information as tip-level information, while maintaining previous coloring of the tree (e.g. by feature metadata). Biologically, this would be a way of showing what particular taxa are unique to which groups of selected samples, or something along those lines.

fedarko commented 4 years ago

From doing some planning, I think there are three types of barplots that would be good to work on supporting (and potentially more if requested):

Assign each tip a bar of fixed length, and alternate the colors of the bars based on a feature metadata field. These could be either categorical colors (e.g. taxonomy annotations) or quantitative colors (e.g. Songbird/ALDEx2/etc. differential values, other types of feature importance scores as suggested by @shihuang047, etc.).

Example: The "Host Class" ring in Fig. 1 of Song/Sanders et al. --
Assign each tip a bar of fixed color, and alternate the lengths of the bars based on a (quantitative) feature metadata field.

Example: The relative abundance barplots in Fig. 2A of Baker et al. (not exactly comparable b/c this barplot has more than one category, but the same general idea) --
Assign each tip a bar of fixed length, and draw a stacked barplot based on this tip's sample presence information for a selected sample metadata field. (To give an idea of what this would look like, for "body site" in the moving pictures dataset, tips unique to gut samples would have a completely red bar; tips split 50/50 between left and right palm samples would have a half blue / half orange bar; and so on.)

Example: The "Diet" ring in Fig. 1 of Song/Sanders et al., see above

I imagine these are ranked roughly in order of how useful they'll be (maybe 3 and 2 could be switched around, though). So IMO it makes sense to start with the first type of barplot. (Happily, I think this will also be the easiest of the three to implement :)

Other considerations

We would ideally allow for users to select multiple "layers" of barplots, which would allow for intricate displays as shown in the Song/Sanders et al. tree above.
Barplots should work with either circular or rectangular layouts, since both of these guarantee that tips will be allocated some space to themselves in a consistent way (... if that makes sense, there's probably a more elegant way to phrase that).
- That being said, it might be best to start off with implementing these for the circular layout first, since most of the figures with barplots I've seen use a circular layout.
All of the figures above (and probably like 95% of the tree figures I've seen while working in bioinformatics, let's be real) use iTOL, so we should of course cite iTOL in the code, paper, etc. as the inspiration for this functionality.

ElDeveloper commented 4 years ago

Thanks for breaking this down @fedarko, very helpful. After thinking about this for a little bit, here's some thoughts. I had to think of it in terms of features and samples:

Feature metadata bars:
- Length defined by feature metadata variable or default size if unspecified.
- Color defined by feature metadata variable or default color - this would optionally support continuous color maps.
Sample metadata bars:
- For a categorical variable, show stacked bar chart of prevalence across samples for a metadata variable. For example percent healthy vs sick samples with each feature (we do this in the qemistree preprint - see figure 3).
- For a continuous variable, show the average value across samples for a metadata variable. For example average pH per feature. In this case color and height can be fixed (initially) but should be able to be defined by other metadata variables.

For drawing the bars, I think using shaders will be the most performant solution. I think addressing #214 should help us get startecd.

In both cases it sounds like we should allow to have multiple rings of information. In any case, I agree that we should start with the case that's easier to implement and move from there.

I agree 🎩-tip to iTOL and other tools like Anvio, ggtree, FigTree, Topiary Explorer, and so many more ✨

biocore / empress

Barplots in the circular / rectangular layout #201

Other considerations