Open Jorge-C opened 11 years ago
I am excited for this feature addition. Is there any possibility of getting it written in python as opposed to wrapping the R package? In my experience, since most developers in the lab are far more familiar with python than with R, code maintainability and extensibility is lower when something is written in R. For instance, supervised_learning.py has had simple outstanding bugs about verbose output and some feature requests (internal discussion, no issue requests) that have long gone unresolved because people are less comfortable with R. Just my two cents, excited for this either way.
On Fri, Nov 8, 2013 at 12:33 PM, Jorge Cañardo Alastuey < notifications@github.com> wrote:
Canonical (or constrained) correspondence analysis is a multivariate ordination technique. It appeared in community ecology (ter Braak 86) and relates community composition to the variation in the environment (or in other factors).
It works from data on abundances or counts of individuals and environmental variables, and outputs ordination axes that maximize niche separation among species. It is better suited to extract the niches of taxa than linear multivariate methods like canonical correlation analysis because it assumes unimodal response curves.
This is already implemented in the R package vegan (already used in compare_categories.py) so it should be easy to wrap this too.
Can I get this assigned to me? I hope to have this before the 1.8 code freeze.
— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1183 .
Good point @wdwvt1
@Jorge-C do you think is too much overhead to implement it on python rather than R?
+1 both the addition and python version.
Also, the output should look like a PCoA file to be able to use it with our current visualization tools. You can use nmds.py as a reference.
+1
On Nov 8, 2013, at 1:03 PM, Will Van Treuren notifications@github.com wrote:
I am excited for this feature addition. Is there any possibility of getting it written in python as opposed to wrapping the R package? In my experience, since most developers in the lab are far more familiar with python than with R, code maintainability and extensibility is lower when something is written in R. For instance, supervised_learning.py has had simple outstanding bugs about verbose output and some feature requests (internal discussion, no issue requests) that have long gone unresolved because people are less comfortable with R. Just my two cents, excited for this either way.
On Fri, Nov 8, 2013 at 12:33 PM, Jorge Cañardo Alastuey < notifications@github.com> wrote:
Canonical (or constrained) correspondence analysis is a multivariate ordination technique. It appeared in community ecology (ter Braak 86) and relates community composition to the variation in the environment (or in other factors).
It works from data on abundances or counts of individuals and environmental variables, and outputs ordination axes that maximize niche separation among species. It is better suited to extract the niches of taxa than linear multivariate methods like canonical correlation analysis because it assumes unimodal response curves.
This is already implemented in the R package vegan (already used in compare_categories.py) so it should be easy to wrap this too.
Can I get this assigned to me? I hope to have this before the 1.8 code freeze.
— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1183 .
— Reply to this email directly or view it on GitHub.
I would be happy to help with any additional overhead that is required for a python implementation @Jorge-C.
On Fri, Nov 8, 2013 at 1:07 PM, Yoshiki Vázquez Baeza < notifications@github.com> wrote:
+1
On Nov 8, 2013, at 1:03 PM, Will Van Treuren notifications@github.com wrote:
I am excited for this feature addition. Is there any possibility of getting it written in python as opposed to wrapping the R package? In my experience, since most developers in the lab are far more familiar with python than with R, code maintainability and extensibility is lower when something is written in R. For instance, supervised_learning.py has had simple outstanding bugs about verbose output and some feature requests (internal discussion, no issue requests) that have long gone unresolved because people are less comfortable with R. Just my two cents, excited for this either way.
On Fri, Nov 8, 2013 at 12:33 PM, Jorge Cañardo Alastuey < notifications@github.com> wrote:
Canonical (or constrained) correspondence analysis is a multivariate ordination technique. It appeared in community ecology (ter Braak 86) and relates community composition to the variation in the environment (or in other factors).
It works from data on abundances or counts of individuals and environmental variables, and outputs ordination axes that maximize niche separation among species. It is better suited to extract the niches of taxa than linear multivariate methods like canonical correlation analysis because it assumes unimodal response curves.
This is already implemented in the R package vegan (already used in compare_categories.py) so it should be easy to wrap this too.
Can I get this assigned to me? I hope to have this before the 1.8 code freeze.
— Reply to this email directly or view it on GitHub< https://github.com/qiime/qiime/issues/1183> .
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1183#issuecomment-28093757 .
:+1:
Note that compare_categories.py has db-RDA, which wraps vegan::capscale
. As I understand it, these are related methods (though it sounds like CCA takes an OTU table while capscale operates on a distance matrix).
Thanks everyone for the inputs.
I really like the idea of implementing it in python, and I guess the overhead of actually having to implement the method is smaller than having to write R myself and maintaining it as a group.
I still need to do some cleanup but I finally got the math right! Source code will be coming soon.
Apart from CCA I also implemented Redundancy Analysis (RDA) and Canonical Analysis. The reason is that I only found one description of CCA without prerequisites, but I couldn't fully match its output to vegan's. Besides, despite simplifying the algorithm (removing some matrix inverses) it still felt like it did too much. So I had to go to Numerical Ecology (2nd edition), and it says that CCA is like RDA but changing this and that so I had to implement both... CA is a much more simple method and most of the time went into proving that some formulas weren't quite right.
I'm not sure how the format in nmds.py
would let us get the desired
plots (called {bi, tri}plots, but I discussed this with @josenavas and
they're not the kind of biplots in QIIME). The output should look like
this:
http://www.canodraw.com/triplot.png
That can be done in fairly standard matplotlib, and in fact there's a
function in cogent (in multivariate_plot.py
) that can show them. The
problem is that it's virtually useless because given the plot there's
no way to know what species/site is represented by each point so you
can't analyze anything. That could be fixed by including annotations
and increasing the number of arguments (9 right now, one is a
dictionary with up to five arrays which are the actual data) but the
function is already long and needs over a hundred lines of auxiliary
code... Before finding this I had written my own plotting function,
which let's you annotate points and is shorter than just the other
main routine.
I've been thinking about the interface, and for a standalone package
useful in ecology, it makes a lot of sense to work in a similar way to
vegan: if the matrices of observations and environmental variables are
dataframes, all output is nicely named. Thus, I'd expect the input
here to be pandas dataframes and still work for arrays, but then
naming would not be ideal. By the way, is anyone using pandas? It's
like R's dataframes, great to work with data, fast, deals with time
series, groupbys... What are your thoughts for QIIME integration? I
assume the contingency table would come as a .biom
file, and names
would be in a mapping file, right? What about the environmental
matrix?
Some things that slowed me down were finding answers to basic questions like "is it using the MLE variance estimator or the sample one", having to implement weighted std (I'll see if there's interest to add it in numpy), avoiding mistakes in the references and going from their numerically-terrible equations to more stable computations.
So, as a summary of important stuff left to do:
Nice stuff that could also be added:
I thought about handling sparse abundance matrices but one of the
first steps is to center it, which destroys sparsity. On the other
hand, it could be helpful to replace scipy.sparse.svds
for
np.linalg.svd
and only compute a handful of singular values/vectors
for really big matrices (but not in general, it uses a much more naïve
algorithm).
@wdwvt1 Your input in this topic will be very useful. Thanks!
@Jorge-C and I are going to meet this week. I am not sure how helpful I will be, but I think its awesome that we will finally have this available. Looking forward to talking with you guys!
On Mon, Dec 9, 2013 at 9:24 AM, josenavas notifications@github.com wrote:
@wdwvt1 https://github.com/wdwvt1 Your input in this topic will be very useful. Thanks!
— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1183#issuecomment-30146288 .
Canonical (or constrained) correspondence analysis is a multivariate ordination technique. It appeared in community ecology (ter Braak 86) and relates community composition to the variation in the environment (or in other factors).
It works from data on abundances or counts of individuals and environmental variables, and outputs ordination axes that maximize niche separation among species. It is better suited to extract the niches of taxa than linear multivariate methods like canonical correlation analysis because it assumes unimodal response curves.
This is already implemented in the R package vegan (already used in
compare_categories.py
) so it should be easy to wrap this too, and really helpful.Can I get this assigned to me? I hope to have this before the 1.8 code freeze.