Adding alternative ordination and clustering methods

mortonjt commented 9 years ago

Looking at Visumap, the majority of those algorithms are found in scikit-learn or scikit-bio, namely

Ordination

Classification

Clustering

Here are some methods found in Visumap that aren't in either of these packages, but would be really cool to have

Also may want to consider the following non-linear ordination methods found in scikit-learn

Here are a few more ideas that I think would be really cool to implement

But before we go about turning this into a TODO list, I think we should first consider the following

Where would be preprocessing procedures live? Should they be placed in QIIME2?
If the above answer is yes, then we will need to redesign the current eigenvector file format to potentially hold 2 sets of eigenvectors rather than just 1

Jorge-C commented 9 years ago

Maybe the computations should be done in skbio?

ElDeveloper commented 9 years ago

A few of these look really interesting, thanks for condensing this into an issue :+1:. The way I see it is that clustering and classification could be supplementary to the ordination methods. For example k-means on its own is entirely irrelevant, unless it is put in the context of a visualization, statistical test or some other way of interpreting the information. The same is true for classification, it doesn't make much sense on its own either.

As for the file formats, I believe all the file formats should be HDF5 based, this would allow for a more flexible structure than what we have through text files. But this is something that needs to be discussed with the rest of the skbio devs.

Now that we are talking about other cool methods, what I would really like to see is convex hulls and minimum spanning ellipsoids, see #18.

Regardless of the methods that we decide to include, the bigger conversation is how are we going to be moving forward, the python objects will live in skbio (no questions asked), however who provides the CLI (if there's any) is the 1MD question. For the sake of simplicity I would say QIIME2 does this, but it's tough to say this won't change in the near future.

On (Mar-05-15|14:59), mortonjt wrote:

Looking at Visumap, the majority of those algorithms are found in scikit-learn or scikit-bio, namely

Ordination

SMACOF

Multidimensional Scaling (Sammon Map)

Correspondence Analysis

Classification

Linear Discriminate Analysis

Nearest Neighbor

Clustering

K-Mean Clustering

Mean-Shift Clustering

Spectral Clustering

Agglomerative Clustering

Here are some methods found in Visumap that aren't in either of these packages, but would be really cool to have

Curvilinear Component Analysis

Stochastic Neighbor Embedding

Self-Organizing Map

Self-Organzing Graph

Also may want to consider the following non-linear ordination methods found in scikit bio

kernal PCA

isomap

probabilitistic PCA

But before we go about turning this into a TODO list, I think we should consider the following

Where would be preprocessing procedures live? Should they be placed in QIIME2?

If the above answer is yes, then we will need to redesign the current eigenvector file format to potentially hold 2 sets of eigenvectors rather than just 1

Reply to this email directly or view it on GitHub: https://github.com/biocore/emperor/issues/356

ElDeveloper commented 9 years ago

:+1:

On (Mar-05-15|16:10), Jorge Ca�ardo Alastuey wrote:

Maybe the computations should be done in skbio?

Reply to this email directly or view it on GitHub: https://github.com/biocore/emperor/issues/356#issuecomment-77480857

mortonjt commented 9 years ago

Added the minimum spanning ellipsoid and convex hull visualization to the list above.

I completely agree with making classification/clustering supplementary to ordination. If you take a look at the 3:00 mark in this Visumap demo, you can see that there are different shapes representing different groups. Maybe we can have a similar scheme for visualizing classification/clustering

As far as the file format goes, I think it would ideal to have the following features

Eigenvectors for samples
Eigenvectors for OTUs (optional)
Eigenvalues

Here are a few of ideas to spur discussion

Would we be interested in including sample/observation metadata in this file? That way, we don't have to do the map file preprocessing in Emperor
Would we want to specify sample/observation metadata colors/shapes/sizes in this file?
When jack-knifing/bootstrapping, would we want to have all of the these eigenvectors/eigenvalues to be in a single file, or in a whole bunch of files (like it is now)?

I've just been thinking, if we can resort to having a comprehensive file format that can have all of the plotting data, metadata and visualization features, that would definitely simplify the user interface for Emperor - all you would need to specify is which program you want to run and the input file.

antgonza commented 9 years ago

Note that http://biom-format.org/documentation/adding_metadata.html already support sample and observation metadata. However, we are not taking advantage of this feature anywhere else.

mortonjt commented 9 years ago

Yup. That is a definite possibility - we could potentially use BIOM tables to store eigenvectors and corresponding metadata. Definitely would be a path of less resistance

ElDeveloper commented 9 years ago

I completely agree with making classification/clustering supplementary to ordination. If you take a look at the 3:00 mark in this Visumap demo, you can see that there are different shapes representing different groups. Maybe we can have a similar scheme for visualizing classification/clustering

Yes this would be awesome! see #128.

Would we be interested in including sample/observation metadata in this file? That way, we don't have to do the map file preprocessing in Emperor That makes perfect sense, though not a lot of people do this through BIOM, except for Phinch.

Would we want to specify sample/observation metadata colors/shapes/sizes in this file?

It would definitely be helpful! But not a requirement.

When jack-knifing/bootstrapping, would we want to have all of the these eigenvectors/eigenvalues to be in a single file, or in a whole bunch of files (like it is now)?

I would oppose to having them in separate files, if we keep the files very simple, I think it will be better on the long run.

Jorge-C commented 9 years ago

I have no idea if this is feasible, but a cool option could be to have emperor consume python objects somehow and avoid text files altogether.

ElDeveloper commented 9 years ago

Hmmm I think that can be done with the emperor object, although not a lot of the features exposed through the command line are available, the bare-bones implementation is available here: http://biocore.github.io/emperor/build/html/generated/emperor.core.Emperor.html#emperor.core.Emperor

from emperor import Emperor
# hack the world

The idea of this was to be able to have an IPython html repr (which is implemented and works), but then this happened: https://github.com/jupyter/nbviewer/issues/316

On (Mar-06-15|13:44), Jorge Ca�ardo Alastuey wrote:

I have no idea if this is feasible, but a cool option could be to have emperor consume python objects somehow and avoid text files altogether.

Reply to this email directly or view it on GitHub: https://github.com/biocore/emperor/issues/356#issuecomment-77641703

Jorge-C commented 9 years ago

Too bad... We should check again with the recently-released 3.0.0 IPython notebook.

2015-03-06 14:48 GMT-07:00 Yoshiki Vázquez Baeza notifications@github.com:

Hmmm I think that can be done with the emperor object, although not a lot of the features exposed through the command line are available, the bare-bones implementation is available here:

http://biocore.github.io/emperor/build/html/generated/emperor.core.Emperor.html#emperor.core.Emperor
from emperor import Emperor
# hack the world
The idea of this was to be able to have an IPython html repr (which is implemented and works), but then this happened: https://github.com/jupyter/nbviewer/issues/316

On (Mar-06-15|13:44), Jorge Ca�ardo Alastuey wrote:

I have no idea if this is feasible, but a cool option could be to have emperor consume python objects somehow and avoid text files altogether.

Reply to this email directly or view it on GitHub: https://github.com/biocore/emperor/issues/356#issuecomment-77641703

— Reply to this email directly or view it on GitHub https://github.com/biocore/emperor/issues/356#issuecomment-77642341.

mortonjt commented 9 years ago

I raised a new issue on how to handle visualization options here: https://github.com/biocore/emperor/issues/357

I've also thought about how to handle jack-knifing/bootstrapping - could we require the user to put all of the resampled samples into a single table? This would allow us to ditch the option to input a directory for input_coords. But this is definitely not a priority

ElDeveloper commented 9 years ago

I just tried out with the latest release version of the ipython notebook and it didn't work. :sad:

On (Mar-06-15|13:54), Jorge Cañardo Alastuey wrote:

Too bad... We should check again with the recently-released 3.0.0 IPython notebook.

2015-03-06 14:48 GMT-07:00 Yoshiki Vázquez Baeza notifications@github.com:
Hmmm I think that can be done with the emperor object, although not a lot of the features exposed through the command line are available, the bare-bones implementation is available here:

http://biocore.github.io/emperor/build/html/generated/emperor.core.Emperor.html#emperor.core.Emperor
from emperor import Emperor
# hack the world
The idea of this was to be able to have an IPython html repr (which is implemented and works), but then this happened: https://github.com/jupyter/nbviewer/issues/316

On (Mar-06-15|13:44), Jorge Ca�ardo Alastuey wrote:

I have no idea if this is feasible, but a cool option could be to have emperor consume python objects somehow and avoid text files altogether.

Reply to this email directly or view it on GitHub: https://github.com/biocore/emperor/issues/356#issuecomment-77641703

— Reply to this email directly or view it on GitHub https://github.com/biocore/emperor/issues/356#issuecomment-77642341.
Reply to this email directly or view it on GitHub: https://github.com/biocore/emperor/issues/356#issuecomment-77643251

Jorge-C commented 9 years ago

:cry:

The problem only happens when sharing a notebook via nbviewer/nbconvert right? I pinged the jupyter issue just in case.

ElDeveloper commented 9 years ago

Yeah, it still is, thanks for doing that!

Yoshiki Vázquez-Baeza

On Mar 6, 2015, at 8:49 PM, Jorge Cañardo Alastuey notifications@github.com wrote:

The problem only happens when sharing a notebook via nbviewer/nbconvert right? I pinged the jupyter issue just in case.

— Reply to this email directly or view it on GitHub.

mortonjt commented 9 years ago

Here's another thought,

Would it be a good idea to have a publicly accessible preprocessing API in emperor? I'm thinking it would have the following benefits

Could potentially alleviate complete dependence on skbio/QIIME. We could had preprocessing functions that can accept any set of eigenvector/eigenvalues to plot, which can make it easier for users to use other dimensionality reduction methods (e.g. methods from scikit-learn)
Will prevent bloating in skbio. I think if we start porting over all of our helper functions into skbio (e.g. biplot calculation), this will bloat the skbio API, which could make it harder to maintain.

ElDeveloper commented 9 years ago

Do you mean something like the Emperor object in core.py? Admittedly that class needs to expand a bit more to cover the full interface provided through the CLI.

On (Apr-02-15|11:54), mortonjt wrote:

Here's another thought,

Would it be a good idea to have a publicly accessible preprocessing API in emperor? I'm thinking it would have the following benefits

Could potentially alleviate complete dependence on skbio/QIIME. We could had preprocessing functions that can accept any set of eigenvector/eigenvalues to plot, which can make it easier for users to use other dimensionality reduction methods (e.g. methods from scikit-learn)

Will prevent bloating in skbio. I think if we start porting over all of our helper functions into skbio (e.g. biplot calculation), this will bloat the skbio API, making it harder to maintain.

Reply to this email directly or view it on GitHub: https://github.com/biocore/emperor/issues/356#issuecomment-89008233

mortonjt commented 9 years ago

Exactly On Apr 2, 2015 1:23 PM, "Yoshiki Vázquez Baeza" notifications@github.com wrote:

Do you mean something like the Emperor object in core.py? Admittedly that class needs to expand a bit more to cover the full interface provided through the CLI.

On (Apr-02-15|11:54), mortonjt wrote:

Here's another thought,

Would it be a good idea to have a publicly accessible preprocessing API in emperor? I'm thinking it would have the following benefits

Could potentially alleviate complete dependence on skbio/QIIME. We could had preprocessing functions that can accept any set of eigenvector/eigenvalues to plot, which can make it easier for users to use other dimensionality reduction methods (e.g. methods from scikit-learn)

Will prevent bloating in skbio. I think if we start porting over all of our helper functions into skbio (e.g. biplot calculation), this will bloat the skbio API, making it harder to maintain.

Reply to this email directly or view it on GitHub: https://github.com/biocore/emperor/issues/356#issuecomment-89008233

Reply to this email directly or view it on GitHub https://github.com/biocore/emperor/issues/356#issuecomment-89015714.

ElDeveloper commented 9 years ago

:+1:, this would also help once we start moving to click for the creation of the command line interface.

On (Apr-02-15|12:32), mortonjt wrote:

Exactly On Apr 2, 2015 1:23 PM, "Yoshiki V�zquez Baeza" notifications@github.com wrote:

Do you mean something like the Emperor object in core.py? Admittedly that class needs to expand a bit more to cover the full interface provided through the CLI.

On (Apr-02-15|11:54), mortonjt wrote:

Here's another thought,

Would it be a good idea to have a publicly accessible preprocessing API in emperor? I'm thinking it would have the following benefits

Could potentially alleviate complete dependence on skbio/QIIME. We could had preprocessing functions that can accept any set of eigenvector/eigenvalues to plot, which can make it easier for users to use other dimensionality reduction methods (e.g. methods from scikit-learn)

Will prevent bloating in skbio. I think if we start porting over all of our helper functions into skbio (e.g. biplot calculation), this will bloat the skbio API, making it harder to maintain.

Reply to this email directly or view it on GitHub: https://github.com/biocore/emperor/issues/356#issuecomment-89008233

Reply to this email directly or view it on GitHub https://github.com/biocore/emperor/issues/356#issuecomment-89015714.

Reply to this email directly or view it on GitHub: https://github.com/biocore/emperor/issues/356#issuecomment-89018426

ConstantinoSchillebeeckx commented 8 years ago

+1 for methods for clustering. Currently I'm doing a pre-processing step on my mapping file using sklearn.cluster.kmeans to generate k clusters. screen shot 2016-02-24 at 11 01 27

biocore / emperor

Adding alternative ordination and clustering methods #356