biocore / qurro

Visualize differentially ranked features (taxa, metabolites, ...) and their log-ratios across samples
https://biocore.github.io/qurro
BSD 3-Clause "New" or "Revised" License
31 stars 10 forks source link

The output index.html is empty #251

Closed Jigyasa3 closed 4 years ago

Jigyasa3 commented 4 years ago

Hey!

I am following the qurro tutorial as a standalone tool. I generated the ordination.txt file by deicode. But when I run the qurro command, the output index.html doesn't load anything. I cannot change the categories or X-axis.

$module load python/3.7.3 $qurro --ranks ordination.txt --table ../wood_kalo.biom --sample-metadata ../sample_metadata_kalotermitidae.tsv --output-dir qurro-plot

202 feature(s) in the BIOM table were not present in the feature rankings. These feature(s) have been removed from the visualization. Successfully generated a visualization in the folder qurro-plot.

The index.html file screenshot- image

fedarko commented 4 years ago

Looking at the browser URL, you need to keep the index.html file in the folder Qurro produced (qurro-plot, in your case). You can move that folder around, but you can't take the index.html file out of it.

The reason for this is that the index.html file on its own is just an HTML file; all of the actual "data" and code for the visualization is stored in other files in the output directory, so just taking something out of that directory will cause problems. (This is also why the HTML file looks kind of gross in the screenshot -- it normally loads a CSS file that makes the page look pretty, but since it can't find that CSS file the page remains mostly unformatted.)

This should really be better documented -- I'll make a note to do that. Thanks for using Qurro!

Jigyasa3 commented 4 years ago

Hey @fedarko , thanks for pointing it out! I have an output HTML file now. I have a follow-up question-

In Qurro index.html file, I can only compare feature ids with common text. But if I want to visualize (but not check for log-ratio) more than one feature type in numerator and denominator, how do I do that?

For example, right now I can compare Numerator="Firmicutes" and Denominator="Bacilli". If I want to visualize multiple taxa like this screenshot below, how can I do that?

image

Jigyasa3 commented 4 years ago

Secondly, When I am comparing two feature IDs in numerator vs denominator, does a positive log-ratio in a group means that the difference between these two features is more? I understand that I can only draw significance differences between groups after ANOVA or permanova. But what do positive and negative log-ratios of two feature ids mean in Qurro?

fedarko commented 4 years ago

In Qurro index.html file, I can only compare feature ids with common text. But if I want to visualize (but not check for log-ratio) more than one feature type in numerator and denominator, how do I do that?

This isn't currently supported in Qurro, unfortunately. I thought I had an issue open reminding myself to work on this, but it looks like I didn't -- so I'm going to keep track of progress on this feature in #253. Making this functionality available will take a while, so I cannot promise it'll be ready any time soon.

That particular figure was manually colored to highlight certain features outside of Qurro in an image editor, which is one way you could display this for now (although this is an inconvenient solution, I know). You could also try loading your rank plot in the Vega Editor (you should be able to do this by clicking on the "..." button next to the rank plot and clicking "Open in Vega Editor") -- this editor will let you change up feature colors or other properties of the plot as much as you want, but the catch is you'll need to know how to work in the Vega-Lite grammar (which can be a bit tricky to get used to, speaking firsthand).

Sorry there aren't a lot of good options for this as of now.

Secondly, When I am comparing two feature IDs in numerator vs denominator, does a positive log-ratio in a group means that the difference between these two features is more? [...] But what do positive and negative log-ratios of two feature ids mean in Qurro?

I'm going to be a bit formal here in order to make this explanation clearer :)

So: we can write the (natural) log-ratio of two positive values A and B as ln(A / B) -- or, equivalently, as ln(A) - ln(B).

This means that if ln(A) - ln(B) > 0 (i.e. the log-ratio is positive), then ln(A) > ln(B) (or equivalently, A > B, since ln(x) is monotonically increasing). Similarly, ln(A) - ln(B) < 0 implies that A < B, and ln(A) - ln(B) = 0 implies that A = B.

What does this mean for us? When you see a sample in Qurro with a positive log-ratio, then, all you can really infer from just that is that -- for that sample -- the numerator is larger than the denominator. If you only have "two Feature IDs" selected, then the same stuff with A and B applies -- in this case A is the abundance of the numerator feature in the sample, and B is the abundance of the denominator feature in the sample.

I understand that I can only draw significance differences between groups after ANOVA or permanova.

If you've selected a log-ratio to look at, you should be able to perform "normal" stats tests on the log-ratio values across samples. This means you should be able to check that (for example) the log-ratios of two groups of samples are significantly different.

This is the main reason we designed Qurro to display log-ratios in the first place -- taking the log-ratio of compositional data has a lot of useful properties mathematically. (This open-access paper, especially the second paragraph of section 3, might be a helpful reference about this subject -- it was for me :)

As an example of this in practice, the preprint you showed a screenshot of above (Baker et al. 2019) actually does significance-testing on the log-ratios generated by Qurro. Fig. 2(E)* in the preprint is an example of this -- in this case the authors used a Welch's t-test, but other options for tests are definitely possible.

* (see pages 30 and 40)

Jigyasa3 commented 4 years ago

Hey @fedarko

Thank you so much for the detailed reply! I really appreciate it! I have one last question (I hope)- I am currently running Emperor pcoa on the Deicode beta-diversity matrix, which gives me the two-axis to extract bacterial features from. Then I run Qurro to get feature abundance and log-ratios of the features.

But I have some bacterial taxa that are positive in one axis and negative in the other (screenshot below). This doesn't affect the log-ratio of bacterial features, but if I want to show the most abundant and least abundant bacterial taxa, the visual representation changes drastically. How can I explain that? image

Thanks again for replying back!

fedarko commented 4 years ago

Thank you so much for the detailed reply! I really appreciate it!

No problem :)

But I have some bacterial taxa that are positive in one axis and negative in the other (screenshot below). This doesn't affect the log-ratio of bacterial features, but if I want to show the most abundant and least abundant bacterial taxa, the visual representation changes drastically. How can I explain that?

If I understand this correctly, you're asking why some features classified in a given taxonomic group have positive feature loading values for an axis while other features in the same group have negative feature loading values for the same axis?

First off, it's hard to tell from your screenshot, but it looks like your features only go down to the family level? I'm not sure if this is just a fluke for the particular feature you have shown in the tooltip in your screenshot, or if you actually collapsed/binned your table by the family level somehow. (If you did collapse your table, this is not recommended when running DEICODE.)

Also, I'll admit that I'm not sure what you mean by "the most abundant and least abundant bacterial taxa." Are you asking about this relative to a given type of sample in your dataset? Or are you asking about the feature(s) with the highest or lowest feature loading values for a given axis? (I don't think you can interpret these loading values as corresponding strictly to "abundance.")

In any case, members of the same taxonomic group can have pretty different biological functions (there are lots of commonly used examples of this, some of which I guess you've probably heard of -- e.g. various E. coli strains doing different things, or some types of Bacillus being pathogenic while others aren't). So depending on how you're choosing to select features in Qurro (are you searching for all features in the same family? genus? species? etc.), then it would make sense that different members of the same taxonomic group could be associated with pretty different environments, and therefore show up in different locations on a rank plot. (There are also plenty of less-satisfying technical reasons for this sort of variation in practice -- sequencing error, bias, etc. -- which could result in weird results.)

Jigyasa3 commented 4 years ago

Hey @fedarko

Thanks for the advice! I am considering the lowest taxonomic level for bacteria now, and understand what high and low feature loading are (it's not abundance). I have one question regarding the log-ratios output from Qurro- Along with getting feature loading of bacterial taxa for microbial functions, I am also examining the log-ratios of gene families for each microbial pathway. If the log-ratio for two gene families of interest explain only 23% of the data, is it correct to convert these log-ratios to presence/absence data for downstream analysis?

For note- I am comparing log-ratios of gene families between host sister clades using PhyloFactor R package (as the host is phylogenetically related).

Sorry, this is not directly related to Qurro, but downstream processing of the Qurro data.

fedarko commented 4 years ago

Along with getting feature loading of bacterial taxa for microbial functions, I am also examining the log-ratios of gene families for each microbial pathway.

Ok. I'm assuming this means that you reran DEICODE on one of the BIOM tables output from something like HUMAnN2, so now the "features" in the Qurro visualization you're looking at are gene families?

If the log-ratio for two gene families of interest explain only 23% of the data, is it correct to convert these log-ratios to presence/absence data for downstream analysis?

  1. I'm assuming that by "data" in "explain only 23% of the data", you mean a certain sample metadata field.

  2. I don't know what you mean by "convert these log-ratios to presence/absence data", sorry. There are a few possibilities I think you might mean --

    • Do you mean something like encoding each sample with a valid log-ratio as 1, and encoding other samples as a 0?
    • Or doing something like that but based on whether or not a sample's log-ratio is > 0 or < 0?
    • Or something else?

(For reference, I don't think either of the 2 mentioned approaches would be useful in most cases...)

In any case, I don't have much experience in functional annotation or PhyloFactor, so I can't say for sure if these approaches would be "correct" or not. Sorry!

All Qurro is doing is taking the log-ratios of certain features (or sums of certain features) for each sample in your dataset; how you interpret these log-ratios is up to you. I was going to suggest asking this question to the PhyloFactor maintainers, but I just checked their repository and it looks like you've already done that :)

I'm sorry I can't be of more help -- this is around the limits of what I know right now.

Jigyasa3 commented 4 years ago

Thank you so much for replying and checking out phylofactor too! I will find something. Thank you for all the help!

Jigyasa3 commented 4 years ago

Hey @fedarko

Sorry for another message. I read Deicode paper https://msystems.asm.org/content/4/1/e00016-19#sec-7 and the tutorials on Qurro. I don't understand how the feature loading figure is made in the paper (shown as a biplot), and in Qurro index.html webpage. What I mean is that the paper and the tutorial says that the log fold change of each feature is plotted. Does that mean log fold change of feature A divided by all other features?

Thanks for the help!

fedarko commented 4 years ago

I don't understand how the feature loading figure is made in the paper (shown as a biplot), ...

The feature loading figures (aka "rank plots") in the DEICODE paper were not made using Qurro; the authors of that paper wrote custom code to plot those loadings. I think most of the code to generate the paper figures should be in this repository.

I'm not sure what you mean by "shown as a biplot," since as far as I can tell none of the figures in the DEICODE paper are biplots. There are a couple of more general ordinations, though.

... and in Qurro index.html webpage.

The rank plot you see in Qurro is analogous to the top Fig. 5C / 5D figures (the rank plots) in the DEICODE paper, at least to the best of my knowledge. For each feature, its' y-axis value is the literal feature loading value. The x-axis placement of features is based on just however you choose to sort/rank the features by their loadings.

There are a few minor differences due to these plots being produced by different code -- some of the differences I know of off the top of my head are:

What I mean is that the paper and the tutorial says that the log fold change of each feature is plotted. Does that mean log fold change of feature A divided by all other features?

I'm not sure what you mean here, or what paper / section you're talking about. We've started using the term "differential" to mean the log-fold change of a feature relative to a given covariate, but this is something different from a "feature loading".

Also, I can't find anything in the Qurro or DEICODE tutorials that mentions log-fold changes.

The DEICODE paper briefly mentions fold changes in 3 parts of the paper, as far as I can tell? (I just ctrl-F'd for "fold" and excluded irrelevant results.) As I understand it, the paper says that Aitchison distance inherently accounts for fold-change of features across samples (as shown in Fig. 1) -- is this what you mean? When you say "log fold change of feature A divided by all other features", it sort of sounds like you're asking about the clr or rclr functions, since their equations are sort of similar to that.

If you have further questions about how DEICODE works, I'd suggest reviewing the equations and text in the "Materials and Methods" section of the paper and asking questions on the DEICODE repository (or on the QIIME 2 forums).

Hope this helps!

fedarko commented 4 years ago

I'm going to close this issue for now -- if you have any other questions about Qurro, feel free to open a new issue :)