biocore / qiime

Official QIIME 1 software repository. QIIME 2 (https://qiime2.org) has succeeded QIIME 1 as of January 2018.
GNU General Public License v2.0
286 stars 267 forks source link

Loadings/Weights on OTUs from PCoA #888

Open jpetteng opened 11 years ago

jpetteng commented 11 years ago

Would be great to have the loadings associated with each OTU from a PCoA to better explore the taxonomic differences between samples/treatments. This can also be done by comparing OTU abundances within the OTU table or using otu_category_significance.py but those results may not be completely correlated with the loadings from a PCoA.

Thanks

alcedo76 commented 11 years ago

:+1:

On Apr 24, 2013, at 12:09 PM, jpetteng wrote:

Would be great to have the loadings associated with each OTU from a PCoA to better explore the taxonomic differences between samples/treatments. This can also be done by comparing OTU abundances within the OTU table or using otu_category_significance.py but those results may not be completely correlated with the loadings from a PCoA.

Thanks

— Reply to this email directly or view it on GitHub.

gregcaporaso commented 11 years ago

@jpetteng and @alcedo76, have you looked at QIIME's biplots? Do those get you what you're looking for?

jpetteng commented 11 years ago

Sorry for the delayed response.

I have not dealt much with the biplots and they do look like they portray some information about the actual taxonomic groups driving differentiation within the multivariate plot.

If that analysis outputs the raw data used to create the plots that seems like it would be a pretty good analog to the actual loadings from the PCoA.

I am assuming that whatever function is being used to run the PCoA outputs loadings (or at least has the option to) but maybe that assumption is false and it is more complicated than that.

Thanks for giving it some consideration.

On Wed, May 8, 2013 at 3:11 PM, Greg Caporaso notifications@github.comwrote:

@jpetteng https://github.com/jpetteng and @alcedo76https://github.com/alcedo76, have you looked at QIIME's biplotshttp://qiime.org/tutorials/tutorial.html#generate-3d-bi-plots? Do those get you what you're looking for?

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/888#issuecomment-17627362 .

alcedo76 commented 11 years ago

I agree with @jpetteng.

Biplots are very useful in can substitute PCoA weights. However, I was wondering if we could have access to the weight scores for each Principal Coordinate and if the OTU contribute significantly to this axis. As in Principal Component Analyses, we could use the component as dependent or independent variables in regressions or General Linear Models. The reason we are not using this axis is because we do not know what each axis represent.

It is likely that at family, genera or species levels the loadings could be a mess, but in high taxonomic groups (phylum or class, even orden) it could be very useful.

Thanks a lot.

On May 16, 2013, at 1:27 PM, jpetteng wrote:

Sorry for the delayed response.

I have not dealt much with the biplots and they do look like they portray some information about the actual taxonomic groups driving differentiation within the multivariate plot.

If that analysis outputs the raw data used to create the plots that seems like it would be a pretty good analog to the actual loadings from the PCoA.

I am assuming that whatever function is being used to run the PCoA outputs loadings (or at least has the option to) but maybe that assumption is false and it is more complicated than that.

Thanks for giving it some consideration.

On Wed, May 8, 2013 at 3:11 PM, Greg Caporaso notifications@github.comwrote:

@jpetteng https://github.com/jpetteng and @alcedo76https://github.com/alcedo76, have you looked at QIIME's biplotshttp://qiime.org/tutorials/tutorial.html#generate-3d-bi-plots? Do those get you what you're looking for?

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/888#issuecomment-17627362 .

— Reply to this email directly or view it on GitHub.

wdwvt1 commented 11 years ago

I agree, we should return the factor loadings and it wouldnt involve much additional work as we would just be returning some of the eigenvalues/eigenvectors from the pcoa calculation thats already occurring. This can be a 1.8 assignment for me if you like.

gregcaporaso commented 11 years ago

OK, sounds like a good plan. @wdwvt1, assigning this to you. Maybe this would make the most sense to hook up as an optional feature to principal_coordinates.py? If the user provides a biom table (so it could work with either taxa tables or OTU tables) it will provide this information in a separate output file? I would recommend formatting the output file as an observation metadata file (see here), as it could ultimately be useful to add that information into a biom table (but even if not, it will a convenient format for it).

wdwvt1 commented 11 years ago

The output format you described seems excellent -- thanks @gregcaporaso! I will examine what is currently in cogent for this and see what needs to be added where. I think it should require only a single additional function, but will report back when I learn more.

antgonza commented 11 years ago

Note that this process is pretty similar to the how we calculate the biplots methods, which are in make_3d_plots.py or make_emperor.py.

wdwvt1 commented 10 years ago

So after some reading and deliberation, I am not sure this is possible. When the featureXsample table is converted to sampleXsample distances using whatever metric we lose all information about the individual features and how they contribute to the sample distances. Additional factor analysis (not implemented in QIIME) might provide a way to give us which features are contributing if the distance metric chosen is Euclidean (i.e. when PCoA is the same as PCA). As far as I understand, the 'factor loadings' are just the eigenvalues (and the associated variance explained calculations). I think that means that this issue should be closed in favor of PCA and factor analysis inclusion for QIIME 2.0.

rob-knight commented 10 years ago

Yes that would make sense as a solution. Or you can do correlations between taxa and PCs which gets at the same issue.

On Nov 28, 2013, at 12:58 AM, "Will Van Treuren" notifications@github.com<mailto:notifications@github.com> wrote:

So after some reading and deliberation, I am not sure this is possible. When the featureXsample table is converted to sampleXsample distances using whatever metric we lose all information about the individual features and how they contribute to the sample distances. Additional factor analysis (not implemented in QIIME) might provide a way to give us which features are contributing if the distance metric chosen is Euclidean (i.e. when PCoA is the same as PCA). As far as I understand, the 'factor loadings' are just the eigenvalues (and the associated variance explained calculations). I think that means that this issue should be closed in favor of PCA and factor analysis inclusion for QIIME 2.0.

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/888#issuecomment-29445513.

bloman2 commented 9 years ago

Hello!

I realize that this conversation occurred quite a while ago at this point, but I was referred here from the QIIME google forum today as an explanation as to why discovering which OTUs are contributing to each PC axis in the beta diversity 3d plot isn't possible.

I would like to express my very high interest in making this a priority in the present development of QIIME. I really like the graphical output through Emperor, but if I can't state which OTUs are driving those Eigenvalues/Proportion Explained/PC axis then unfortunately I am inclined to ditch this technology for one that can actually provide a direct explanation for what I'm seeing in the graphic, even if that software's graphic is of significantly lower quality. Not to say that I've found such software yet...

Thanks for taking this under consideration!

antgonza commented 9 years ago

Just to be clear, the shortcoming is from PCoA and not from QIIME/Emperor. PCoA uses a distance matrix (unweighted/weighted unifrac, bray curtis, etc) which hides the contributions of each OTU. In the other hand, PCA can generate this values as everything is done in Euclidean space (basically a Euclidean distance matrix - AKA: you could use this in PCoA). However, using Euclidean distance to check for dissimilarities is not the best for microbial data (see PMID: 20818378).

Now, you could generate your plots using Euclidean distances in QIIME, display the results in Emperor and calculate the weight factors in another tool. They should match. Also, note that we are planning to add the final step to QIIME 2.0.

Finally, if you want to see the contributions of OTUs to the distribution in space, biplots should be able to do that. Suggest looking at figure 1 in PMID: 23861384.

bloman2 commented 9 years ago

Sincere thanks for the explanation and especially for citing the helpful resources. After reading the explanations provided by others, I wondered if there would be a way to retain or record the contributions of each OTU to the distance matrix used by UniFrac (but I am far from understanding any of the mechanics). I think that biplots will work well for now. I look forward to the 2.0 release!

antgonza commented 9 years ago

I think that's the idea of weighted (account for abundance) metrics ... thank you for your feedback and comments.