biocore / emperor

Emperor a tool for the analysis and visualization of large microbial ecology datasets
http://biocore.github.io/emperor/
Other
52 stars 50 forks source link

Adding functionality for compositional biplots #344

Closed mortonjt closed 6 years ago

mortonjt commented 9 years ago

Continuing the thread from https://github.com/biocore/scikit-bio/issues/685

I think having a compositional biplot would be a really nice asset to Emperor.

The basic idea is as follows 1) Fill in zero values in OTU table 2) Perform centre log ratio (clr) transformation on OTU table 3) Perform singular value decomposition (svd) 4) Plot OTU eigenvectors as vectors and sample eigenvectors as points

I'm basing this biplot construction off of page 37 here

I'm not sure exactly what the plan is concerning syncing Emperor with the skbio contingency table. But from what I can tell, the biplots function already takes in a OTU table as a parameter. And the clr() function, svd() function and the zero replacement function I have coded up are pretty lightweight at the moment.

Would anyone be down to review an upcoming PR for this?

These plots will have the following dependency

antgonza commented 9 years ago

YES! That will be awesome, if you have questions about Emperor or how to do things, just let us know.

Note that we are currently discussing the best skbio/plotting-tools division. In the meanwhile, I think the functionality to do those plots should be in skbio but it should yield variables, objects or files that can be consumed in plotting tools that do the rendering.

ElDeveloper commented 9 years ago

This is awesome, thanks @mortonjt! Let me know if you need any help!

On (Feb-16-15|22:50), mortonjt wrote:

Continuing the thread from https://github.com/biocore/scikit-bio/issues/685

I think having a compositional biplot would be a really nice asset to Emperor.

  • It can approximate correlations between OTUs
  • The OTU information can explain variability across samples
  • It is a completely contrasts the existing PCoA plots on distance metrics

The basic idea is as follows 1) Fill in zero values in OTU table 2) Perform centre log ratio (clr) transformation on OTU table 3) Perform singular value decomposition (svd) 4) Plot OTU eigenvectors as vectors and sample eigenvectors as points

I'm basing this biplot construction off of page 37 here

I'm not sure exactly what the plan is concerning syncing Emperor with the skbio contingency table. But from what I can tell, the biplots function already takes in a OTU table as a parameter. And the clr() function, svd() function and the zero replacement functions I have are pretty lightweight at the moment.

Would it be okay if I start drafting up a PR for this new biplot in the next few days?


Reply to this email directly or view it on GitHub: https://github.com/biocore/emperor/issues/344

mortonjt commented 9 years ago

Started drafting out the biplot code.

Here's what I'm thinking are the options

I'm leaning towards option 2, partially because I'm not very familiar with the code, but also because I think option 1 may wreck the current API. Any thoughts?

ElDeveloper commented 9 years ago

I think the following is going to help us figure out what is the best path moving forward:

Do we see this as something that can be complementary added to existing ordination plots? Should we always have these in the context of a principal coordinates plot? It seems to me that this isn't (or shouldn't be)the case, but if it was, then we would need to have it as part of make_emperor.py.

For the second option, my main question would be: What's the output of this script? What are the options from make_emperor.py that we would need to re-use/replicate? Can we re-use the GUI or do we need something new?

I'm super excited of get this moving and to get this to the point where it is adopted by more users!

On (Feb-26-15|19:20), mortonjt wrote:

Started drafting out the biplot code.

Here's what I'm thinking are the options

  • Modify the existing code to support compositional biplots. So add an extra command line parameter in make_emperor.py. This will also probably require extra arguments in some of the existing functions (e.g. preprocess_otu_table)
  • The alternative is to create a completely different script named make_compositional_biplot.py. Rather than taking a distance matrix, it'll only take an OTU matrix, but will have similar optional arguments.

I'm leaning towards option 2, partially because I'm not very familiar with the code, but also because I think option 1 may wreck the current API. Any thoughts?


Reply to this email directly or view it on GitHub: https://github.com/biocore/emperor/issues/344#issuecomment-76324346

mortonjt commented 9 years ago

To be honest, I don't completely understand the math behind the biplots in qiime. I raised this issue here. If they turn about to be the some sort of plots, then I think I can enhance the interpretability of the existing biplots in the future.

The compositional biplots I'm proposing follow a completely different pipeline. If it is merged into make_emperor.py several things would need to happen. The requirement for --input_coords parameter will need to be scraped (since the biplots don't require it). Some of the functions will need to be changed (e.g. preprocess_otu_table), or similar alternative functions will need to be built. I think it is very doable - but I would need to think of a way to merge it in without being too intrusive.

The output of the script will be a javascript file, just like the other PCoA plots. I think I can reuse much of the code to create these plots.

mortonjt commented 9 years ago

I should also add, eventually I think it would be a good idea to have support for tetrahedron plots where 4 of the edges would be PC components. Its also more natural way to view compositions.

If we decide to add these plots in the more distant future, it would definitely require its own custom scripts, since the plotting utilities to develop this will be completely different.

So to answer your question @ElDeveloper , I think the compositional biplot may fit in make_emperor.py. But there are other compositional plotting tools that probably won't fit as well under the existing framework.

ElDeveloper commented 9 years ago

@mortonjt thanks, I've linked the issue to the original authors of the code, hopefully they'll be able to provide more details about the math behind the code. Originally I only ported the code from QIIME into Emperor.

@antgonza what do you think about that? We could continue to expand make_emperor.py to accommodate the compositional biplots. I'm not sure how I feel about this, we already have these types of plots:

So it would be a matter of adding a new type of plot, which doesn't seem that bad and as long as it doesn't disrupt the main command line interface too much, we should be ok**.

Maybe a good initial approach would be to outline what the script needs to do, along with the CLI options and then figure out whether we should look for a way to integrate into make_emperor.py or if we should instead just create a new type of plot.

This is kind of related to #313, if we were to add click command line interfaces, then we could have something like:

emperor pcoa -i ....
emperor pcoa-vectors .....
emperor pcoa-jackknifed ..
emperor compositional -t ....

Then each plot becomes a subcommand.

In any case my suggestion moving forward is to do whatever is easier and less disruptive. Once we have a working demonstration of the feature, the path that makes the most sense will be evident. Please let me know if you would like to Skype/Hangouts and talk more about this. Or if you need help with the codebase.


\ The 0.9.x series of Emperor has to guarantee full compatibility with QIIME 1.9.0 so if we wanted to do big changes to make_emperor.py, we have to make sure the changes are backward compatible. Otherwise, we need to create a 1.0 branch and develop against that.

antgonza commented 9 years ago

I think having only one script is the way to go, in the past multiple scripts was hard to maintain.

Now, perhaps we can do something similar to --compare_plots and use it like this: if you pass a preprocess_otu_table you will do a biplot by default but if you pass another flag it will do other of the compositional parameters. However, to fully define the parameter combinations we will need to have a list of all the parameters and the inputs/outputs each of them have, @mortonjt could you put this together? Thanks!

mortonjt commented 9 years ago

Listing the parameter combinations for both of these plots here

PCoA biplot
Required
-i --input_coords = Input PCoA coordinates *
-t --taxa_fp = summarized taxa file
-m --map_fp = Input map metadata file
--biplot_fp = Output path for taxa coordinates
Optional
--number_of_axes
--output_dir
--add_unique_columns
--add_vectors
--color_by
-n --n_taxa_to_keep
-x --missing_custom_axes_values
-o --output-dir
--number_of_segments
--pct_variation_below_one
--ignore_missing_samples
Compositional biplot
Required
-t --taxa_fp = summarized taxa file
-m --map_fp = Input map metadata file
--biplot_fp = Output path for taxa coordinates
--composition (True/False)* 
Optional
--number_of_axes
--output_dir
--add_unique_columns
--add_vectors
--color_by 
-n --n_taxa_to_keep
-x --missing_custom_axes_values
-o --output-dir
--number_of_segments
--pct_variation_below_one
--ignore_missing_samples
mortonjt commented 9 years ago

The only major changes to the command line are marked in a *

This may be the best way to quickly get functionality for the compositional biplots. However, I think the command-line interface will become increasingly hard to modify as more options are added. I think @ElDeveloper 's suggestion of using a click interface in 1.0 branch may the way to go to ensure flexibility in the future. And I don't think multiple scripts will be required to do this.

antgonza commented 9 years ago

Just for clarity, the only change on the interface at this point is the addition of a flag, right? Also, at this point we do not know exactly which other changes will be needed, right or do we? BTW It is not clear for me how click will make things easier but agree on the change ...

mortonjt commented 9 years ago

Right now, I'm writing up a minimal example to generate the 3D compositional biplots - so I'll have an answer about what additional options we'll need in a bit!

But from what I understand at the moment

I actually haven't used the click interface myself. But from what I read, it can easily allow for nested commands - which may ease specifying parameter options (such as specifying required options for each plot). But we can make this decision later

antgonza commented 9 years ago

Thanks for the explanations so really we need to redefine (or expand) input_coords to make it optional when composition flag is present.

mortonjt commented 9 years ago

Sorry about that - accidental mouse click ...

Anyways, I think it would be extremely beneficial to add in arrows in both of the biplots. It can yield information about correlation between taxa and variability explained by the taxa.

So, I'll first submit a PR to add this feature in. Then adding in compositional biplots should be really easy.

antgonza commented 9 years ago

Agree!!! :+1:

ElDeveloper commented 9 years ago

I think having only one script is the way to go, in the past multiple scripts was hard to maintain.

Yes, this was a huge problem with compare_3d_plots.py and make_3d_plots.py.

BTW It is not clear for me how click will make things easier but agree on the change ...

The thing that click would buy us would be clarity. Commands would then be organized in sub-commands which is what other complex command line tools do, and this in turn would only present relevant flags for the sub-command you are using. For example you would only be presented with the -t option if you called emperor biplot --help or emperor compositional-biplot --help.

Anyways, I think it would be extremely beneficial to add in arrows in both of the biplots. It can yield information about correlation between taxa and variability explained by the taxa.

I see you submitted a PR, let's continue that conversation there. THANKS!! :clap:

mortonjt commented 9 years ago

Cool now. We got the taxa vectors functionality merged, we'll need to think about the best way to incorporate the compositional biplot.

I've been thinking - would it be a good idea to put this compositional biplot calculations into QIIME/QIIME2?

That way it would be a preprocessing step and it Emperor happens to crash, you wouldn't have to regenerate your data.

If we decide that's the way to go, then we would have to rethink how we should redesign the input coordinates file. I mentioned it in this issue

ElDeveloper commented 9 years ago

Do you have any idea of how demanding these methods are? in terms of memory and CPU usage. Regardless of the answer to the previous question, I think we'll need a command line interface somewhere, so far it seems like QIIME would be the best place to host these scripts (similarly to principal_coordinates.py, nmds.py).

On (Mar-05-15|15:05), mortonjt wrote:

Cool now. We got the taxa vectors functionality merged, we'll need to think about the best way to incorporate the compositional biplot.

I've been thinking - would it be a good idea to put this compositional biplot calculations into QIIME/QIIME2?

That way it would be a preprocessing step and it Emperor happens to crash, you wouldn't have to regenerate your data.

If we decide that's the way to go, then we would have to rethink how we should redesign the input coordinates file. I mentioned it in this issue


Reply to this email directly or view it on GitHub: https://github.com/biocore/emperor/issues/344#issuecomment-77472037

mortonjt commented 9 years ago

The compositional biplots themselves aren't computationally expensive at all. On a 1000x100000 matrix, it takes ~15 minutes to run SVD.

However, this could become a nuisance if all of the computation takes place in Emperor. Particularly if I wanted to do bootstrapping/jacknifing.

ElDeveloper commented 9 years ago

Good to know that! Indeed, that's why I believe this should be somewhere else or if we are going to have this here, it must absolutely be a separate step from the plot creation.

On (Mar-05-15|16:30), mortonjt wrote:

The compositional biplots themselves aren't computationally expensive at all. One a 1000x100000 matrix, it takes ~15 minutes to run SVD.

However, this could become a nuisance if all of the computation takes place in Emperor. Particularly if I wanted to do bootstrapping/jacknifing.


Reply to this email directly or view it on GitHub: https://github.com/biocore/emperor/issues/344#issuecomment-77483236

Jorge-C commented 9 years ago

@mortonjt I'm just curious, is that a full svd of a dense matrix? Using numpy? What lapack library?

mortonjt commented 9 years ago

Yup. Its the full svd of a dense matrix. Using Numpy. I have the multithreaded OpenBLAS library installed. But I'm beginning to doubt my memory - maybe I should actually time it ...

antgonza commented 9 years ago

From previous discussions:

ElDeveloper commented 6 years ago

This was fixed in #646, and the task of calculating the actual biplot has been deferred to skbio or qiime2 (where the functionality makes more sense).