Add Canonical Correspondence Analysis (CCA)

Jorge-C commented 11 years ago

Canonical (or constrained) correspondence analysis is a multivariate ordination technique. It appeared in community ecology (ter Braak 86) and relates community composition to the variation in the environment (or in other factors).

It works from data on abundances or counts of individuals and environmental variables, and outputs ordination axes that maximize niche separation among species. It is better suited to extract the niches of taxa than linear multivariate methods like canonical correlation analysis because it assumes unimodal response curves.

This is already implemented in the R package vegan (already used in compare_categories.py) so it should be easy to wrap this too, and really helpful.

Can I get this assigned to me? I hope to have this before the 1.8 code freeze.

wdwvt1 commented 11 years ago

I am excited for this feature addition. Is there any possibility of getting it written in python as opposed to wrapping the R package? In my experience, since most developers in the lab are far more familiar with python than with R, code maintainability and extensibility is lower when something is written in R. For instance, supervised_learning.py has had simple outstanding bugs about verbose output and some feature requests (internal discussion, no issue requests) that have long gone unresolved because people are less comfortable with R. Just my two cents, excited for this either way.

On Fri, Nov 8, 2013 at 12:33 PM, Jorge Cañardo Alastuey < notifications@github.com> wrote:

Canonical (or constrained) correspondence analysis is a multivariate ordination technique. It appeared in community ecology (ter Braak 86) and relates community composition to the variation in the environment (or in other factors).

It works from data on abundances or counts of individuals and environmental variables, and outputs ordination axes that maximize niche separation among species. It is better suited to extract the niches of taxa than linear multivariate methods like canonical correlation analysis because it assumes unimodal response curves.

This is already implemented in the R package vegan (already used in compare_categories.py) so it should be easy to wrap this too.

Can I get this assigned to me? I hope to have this before the 1.8 code freeze.

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1183 .

josenavas commented 11 years ago

Good point @wdwvt1

@Jorge-C do you think is too much overhead to implement it on python rather than R?

antgonza commented 11 years ago

+1 both the addition and python version.

Also, the output should look like a PCoA file to be able to use it with our current visualization tools. You can use nmds.py as a reference.

ElDeveloper commented 11 years ago

+1

On Nov 8, 2013, at 1:03 PM, Will Van Treuren notifications@github.com wrote:

I am excited for this feature addition. Is there any possibility of getting it written in python as opposed to wrapping the R package? In my experience, since most developers in the lab are far more familiar with python than with R, code maintainability and extensibility is lower when something is written in R. For instance, supervised_learning.py has had simple outstanding bugs about verbose output and some feature requests (internal discussion, no issue requests) that have long gone unresolved because people are less comfortable with R. Just my two cents, excited for this either way.

On Fri, Nov 8, 2013 at 12:33 PM, Jorge Cañardo Alastuey < notifications@github.com> wrote:

Canonical (or constrained) correspondence analysis is a multivariate ordination technique. It appeared in community ecology (ter Braak 86) and relates community composition to the variation in the environment (or in other factors).

It works from data on abundances or counts of individuals and environmental variables, and outputs ordination axes that maximize niche separation among species. It is better suited to extract the niches of taxa than linear multivariate methods like canonical correlation analysis because it assumes unimodal response curves.

This is already implemented in the R package vegan (already used in compare_categories.py) so it should be easy to wrap this too.

Can I get this assigned to me? I hope to have this before the 1.8 code freeze.

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1183 .

— Reply to this email directly or view it on GitHub.

wdwvt1 commented 11 years ago

I would be happy to help with any additional overhead that is required for a python implementation @Jorge-C.

On Fri, Nov 8, 2013 at 1:07 PM, Yoshiki Vázquez Baeza < notifications@github.com> wrote:

+1

On Nov 8, 2013, at 1:03 PM, Will Van Treuren notifications@github.com wrote:

I am excited for this feature addition. Is there any possibility of getting it written in python as opposed to wrapping the R package? In my experience, since most developers in the lab are far more familiar with python than with R, code maintainability and extensibility is lower when something is written in R. For instance, supervised_learning.py has had simple outstanding bugs about verbose output and some feature requests (internal discussion, no issue requests) that have long gone unresolved because people are less comfortable with R. Just my two cents, excited for this either way.

On Fri, Nov 8, 2013 at 12:33 PM, Jorge Cañardo Alastuey < notifications@github.com> wrote:

Canonical (or constrained) correspondence analysis is a multivariate ordination technique. It appeared in community ecology (ter Braak 86) and relates community composition to the variation in the environment (or in other factors).

It works from data on abundances or counts of individuals and environmental variables, and outputs ordination axes that maximize niche separation among species. It is better suited to extract the niches of taxa than linear multivariate methods like canonical correlation analysis because it assumes unimodal response curves.

This is already implemented in the R package vegan (already used in compare_categories.py) so it should be easy to wrap this too.

Can I get this assigned to me? I hope to have this before the 1.8 code freeze.

— Reply to this email directly or view it on GitHub< https://github.com/qiime/qiime/issues/1183> .

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1183#issuecomment-28093757 .

jairideout commented 11 years ago

:+1:

Note that compare_categories.py has db-RDA, which wraps vegan::capscale. As I understand it, these are related methods (though it sounds like CCA takes an OTU table while capscale operates on a distance matrix).

Jorge-C commented 11 years ago

Thanks everyone for the inputs.

I really like the idea of implementing it in python, and I guess the overhead of actually having to implement the method is smaller than having to write R myself and maintaining it as a group.

Jorge-C commented 10 years ago

I still need to do some cleanup but I finally got the math right! Source code will be coming soon.

Implemented stuff

Apart from CCA I also implemented Redundancy Analysis (RDA) and Canonical Analysis. The reason is that I only found one description of CCA without prerequisites, but I couldn't fully match its output to vegan's. Besides, despite simplifying the algorithm (removing some matrix inverses) it still felt like it did too much. So I had to go to Numerical Ecology (2nd edition), and it says that CCA is like RDA but changing this and that so I had to implement both... CA is a much more simple method and most of the time went into proving that some formulas weren't quite right.

Plotting

I'm not sure how the format in nmds.py would let us get the desired plots (called {bi, tri}plots, but I discussed this with @josenavas and they're not the kind of biplots in QIIME). The output should look like this:

http://www.canodraw.com/triplot.png

That can be done in fairly standard matplotlib, and in fact there's a function in cogent (in multivariate_plot.py) that can show them. The problem is that it's virtually useless because given the plot there's no way to know what species/site is represented by each point so you can't analyze anything. That could be fixed by including annotations and increasing the number of arguments (9 right now, one is a dictionary with up to five arrays which are the actual data) but the function is already long and needs over a hundred lines of auxiliary code... Before finding this I had written my own plotting function, which let's you annotate points and is shorter than just the other main routine.

Interface

I've been thinking about the interface, and for a standalone package useful in ecology, it makes a lot of sense to work in a similar way to vegan: if the matrices of observations and environmental variables are dataframes, all output is nicely named. Thus, I'd expect the input here to be pandas dataframes and still work for arrays, but then naming would not be ideal. By the way, is anyone using pandas? It's like R's dataframes, great to work with data, fast, deals with time series, groupbys... What are your thoughts for QIIME integration? I assume the contingency table would come as a .biom file, and names would be in a mapping file, right? What about the environmental matrix?

Issues on the way

Some things that slowed me down were finding answers to basic questions like "is it using the MLE variance estimator or the sample one", having to implement weighted std (I'll see if there's interest to add it in numpy), avoiding mistakes in the references and going from their numerically-terrible equations to more stable computations.

ToDo

So, as a summary of important stuff left to do:

Improve testing (thoughts on using r2py to compare some results with R's vegan? Comparing results, even in different machines, is not straightforward because signs don't matter in eigenvectors and I guess even the scaling can change)
Cleanup (I finally know what stuff is interesting to show, I need to remove columns full of ~0s by keeping track of some matrix ranks, and you don't want to see me returning locals() to be able to find what intermediate variable I needed to scale some result in some case ;))
Improve docs (how expanded should they be? I guess a small introduction to each method, some pointers on when they can be useful, and references would be enough.)
Part of the interface and QIIME integration.

Nice stuff that could also be added:

Significance tests for RDA and CCA. Explained in (Legendre & Legendre 1998, 11.3.2) and I think included in vegan.
Partial CCA/RDA (being able to remove the effect of a matrix, ie, seasonal variations). Included in vegan.
Adding scores computation for more scalings (currently 1 and 2, there's also 3 and with negative numbers). Figuring out the right numbers to compute should be quick but requires more docs to help users choose.
Another package (anacor) can also work with missing data. I don't know how different the algorithm is.

I thought about handling sparse abundance matrices but one of the first steps is to center it, which destroys sparsity. On the other hand, it could be helpful to replace scipy.sparse.svds for np.linalg.svd and only compute a handful of singular values/vectors for really big matrices (but not in general, it uses a much more naïve algorithm).

josenavas commented 10 years ago

@wdwvt1 Your input in this topic will be very useful. Thanks!

wdwvt1 commented 10 years ago

@Jorge-C and I are going to meet this week. I am not sure how helpful I will be, but I think its awesome that we will finally have this available. Looking forward to talking with you guys!

On Mon, Dec 9, 2013 at 9:24 AM, josenavas notifications@github.com wrote:

@wdwvt1 https://github.com/wdwvt1 Your input in this topic will be very useful. Thanks!

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1183#issuecomment-30146288 .

biocore / qiime