gm.prcomp optimization for large numbers of species

Hello -

I'm experimenting with the phylogenetic PCA / PACA to analyze microbial communities. My datasets often have thousands of species. I have left gm.prcomp running for hours on what is actually a relatively small dataset for me, and have no idea how much longer it may take, so I started looking into ways to speed things up.

Within the gm.prcomp function, the first hurdle appears to be 'anc.BM'. I first thought the 'sapply' function within could be simply change to something like mclapply to at least parallelize things, but I tried it and it's still taking quite a while, and now takes a lot of memory. It looks to me like there is a lot of redundant computation here - for simple ancestral state prediction, if you start at the tips and move toward the root, then you should only have to visit each node once while accumulating the ancestral states, right? Whereas a function like sapply is going to process each node individually - and the nodes closer to the root have to re-calculate all the more-tipwise results.

I can probably sit and code up something myself, but I'm not sure if I'm missing something important! I'm a bit concerned that even if I put the time into speeding up the ANC portion, later parts are still going to be intractable (e.g. the naive covariance matrix inversion in phylo.mat... which might be possible to speed up with tricks like those used in MCMCglmm?).

Let me know if you have thought about these issues, have any ideas to optimize things for larger datasets, or if you'd be happy for me to try to contribute! Thank you!

Dear Ryan,

Let me give this some thought. I agree the function is not optimized and to be honest, it is now kind of old and could use some updating. I don’t think the use of the sapply function is the issue, and switching to the mclapply approach in your case might not help things too much. I think the problem is the enormous matrices involved in the calculations, both in terms of the memory needed to store them in the function environment plus make the calculations. When I originally contrived this function and its support functions, looking at options in ape and phytools as guidance, the goal was to have a matrix calculation that would be quicker than cycling through a recursive algorithm that finds the node-by-node ancestral character state estimates using PICs, variable-by-variable in a highly multivariate data set. This is what phytools::fastAnc does, using ape::ace, which relies on C to do the computational work. It is a nice approach with a single variable, but if done variable by variable, it suffers a similar problem that you ponder, why keep going to the same node multiple times to do similar calculations? (The way the algorithm works is to reroot the tree, node by node, calculate all PICs but retain just the first PIC as the ancestral state for the target node.) There is certainly a way to make an algorithm that alternatively reroots the tree at each node and then calculates the first PIC for each of many variables — that might be the best solution.

I think the anc.BM function we use is proficient to a point, but when the matrices become too large, the algebra needs to be abandoned for algorithms. Your example is elucidating this as an issue. The algorithms might also be slow but would not be as prohibitive. Ideally, writing C++ code that does what the ape::ace function does with PICs but for multivariate data would be the way to go, but that is quite an undertaking and one that would not be a simple package update. However, I think I can try to find a better solution than what we have now. It maybe will take a couple of weeks though.

Do you need to obtain the ancestral states for your analysis? Or is this just a roadblock for what you are trying to do? There might be faster ways to perform ordinations by avoiding gm.prcomp and using RRPP::ordinate, if you did not explicitly need to put a tree in the plot.

Thanks for the alert!

Mike

On Apr 15, 2024, at 5:13 PM, Ryan McMinds @.***> wrote:

Hello -

I'm experimenting with the phylogenetic PCA / PACA to analyze microbial communities. My datasets often have thousands of species. I have left gm.prcomp running for hours on what is actually a relatively small dataset for me, and have no idea how much longer it may take, so I started looking into ways to speed things up.

Within the gm.prcomp function, the first hurdle appears to be 'anc.BM'. I first thought the 'sapply' function within could be simply change to something like mclapply to at least parallelize things, but I tried it and it's still taking quite a while, and now takes a lot of memory. It looks to me like there is a lot of redundant computation here - for simple ancestral state prediction, if you start at the tips and move toward the root, then you should only have to visit each node once while accumulating the ancestral states, right? Whereas a function like sapply is going to process each node individually - and the nodes closer to the root have to re-calculate all the more-tipwise results.

I can probably sit and code up something myself, but I'm not sure if I'm missing something important! I'm a bit concerned that even if I put the time into speeding up the ANC portion, later parts are still going to be intractable (e.g. the naive covariance matrix inversion in phylo.mat... which might be possible to speed up with tricks like those used in MCMCglmm?).

Let me know if you have thought about these issues, have any ideas to optimize things for larger datasets, or if you'd be happy for me to try to contribute! Thank you!

— Reply to this email directly, view it on GitHub https://github.com/geomorphR/geomorph/issues/48, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUNU4WEJFWXAUD7OEWUMITY5Q7IRAVCNFSM6AAAAABGIC5M4OVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2DINRTGA4TANY. You are receiving this because you are subscribed to this thread.

Hi Mike -

Thanks for your reply!

That's a good point about just using ordinate() if I don't need to ancestral states. My initial goal was just to look at sample scores and how they're affected by incorporating the phylogeny of the taxa, so that's a good start. I played with it a bit and was still having some speed issues; I wound up just doing the matrix math and svd with base R functions and it worked pretty quick (but there is definitely a chance I'm missing something important about the math...)

I do think it'd be nice to have the ancestral states and to be able to do some of the fancy plotting! But obviously with such a large phylogeny it'd need a few extra complications to filter it down, and only plot a few important nodes or something. So that's a bigger project. Still, I did find this function from Rphylopars that is orders of magnitude faster for ancestral state calculations and could be a drop-in replacement: Rphylopars::anc.recon.

In case you're curious, I am just trying to see how your methods compare/contrast with some 'traditional' microbiome PCoA methods, like using Weighted UniFrac vs. Bray Curtis dissimilarities. The phylogenetic information in a UniFrac PCoA often returns clearer patterns in our data, but I'd like to bridge the gap with more generative (and perhaps more interpretable) models of dimension reduction, like the package 'zinbwave' (which is very nice for my purposes but doesn't really have a way to incorporate phylogenetic information).

Hi Ryan,

Thanks for the tip! I was familiar with Rphylopars::anc.recon at one time but I don’t think at the time I appreciated that it used the Ho and Ané (2014) algorithm and how much better that algorithm is. I might have thought it was fast by extending the same algorithm as fastANC to multivariate data, leveraging C++. I just tried to code the algorithm in R, and although I have some calculations to fix, I can see that it will cycle much faster than what we have going. Might take a few days, but I will work it in.

As a rule we try not to have extraneous dependencies, so using anc.recon is not something we would do. Creating a similar function in C++ might be a one-day goal, but that would require better bench-testing and error-checking, first. Replacing anc.BM with a different algorithm though is low fruit to pick. I can seek to do that. It will not be as fast as anc.recon, but probably only people like you would notice the difference between C++ and R versions.

Thanks again! Mike

On Apr 16, 2024, at 5:55 PM, Ryan McMinds @.***> wrote:

Hi Mike -

Thanks for your reply!

That's a good point about just using ordinate() if I don't need to ancestral states. My initial goal was just to look at sample scores and how they're affected by incorporating the phylogeny of the taxa, so that's a good start. I played with it a bit and was still having some speed issues; I wound up just doing the matrix math and svd with base R functions and it worked pretty quick (but there is definitely a chance I'm missing something important about the math...)

I do think it'd be nice to have the ancestral states and to be able to do some of the fancy plotting! But obviously with such a large phylogeny it'd need a few extra complications to filter it down, and only plot a few important nodes or something. So that's a bigger project. Still, I did find this function from Rphylopars that is orders of magnitude faster for ancestral state calculations and could be a drop-in replacement: Rphylopars::anc.recon.

In case you're curious, I am just trying to see how your methods compare/contrast with some 'traditional' microbiome PCoA methods, like using Weighted UniFrac vs. Bray Curtis dissimilarities. The phylogenetic information in a UniFrac PCoA often returns clearer patterns in our data, but I'd like to bridge the gap with more generative (and perhaps more interpretable) models of dimension reduction, like the package 'zinbwave' (which is very nice for my purposes but doesn't really have a way to incorporate phylogenetic information).

— Reply to this email directly, view it on GitHub https://github.com/geomorphR/geomorph/issues/48#issuecomment-2059982000, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUNU4XO4JEUNAVQ2BEEPUDY5WM5ZAVCNFSM6AAAAABGIC5M4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZHE4DEMBQGA. You are receiving this because you commented.

Great; thanks for the info! I do understand avoiding the clutter of dependencies...

Dear Ryan,

I fixed this time issue by recoding the function to use the Ho and Ané (2014) algorithm. You can update the RRPP/geomorph packages from github ("mlcollyer/RRPP", "geomorphR/geomorph”). I recommend updating both packages because they share some support code, and the anc.BM function was in the support code.

As an example of improvement,

library(phytools) tree <- pbtree(n = 500) Y <- fastBM(tree, nsim = 100)

system.time(ace1 <- anc.BM_old(tree, Y)) user system elapsed 6.085 0.615 6.698 system.time(ace2 <- RRPP:::anc.BM(tree, Y)) user system elapsed 0.011 0.000 0.012 all.equal(ace1, ace2) [1] TRUE

Thanks for alerting us to this issue. This was not a function we would have previously considered optimizing and if not for your example, it would have remained inefficient.

Best, Mike

On Apr 18, 2024, at 1:54 PM, Ryan McMinds @.***> wrote:

Great; thanks for the info! I do understand avoiding the clutter of dependencies...

— Reply to this email directly, view it on GitHub https://github.com/geomorphR/geomorph/issues/48#issuecomment-2064749031, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUNU4T3N256FJQ2GU57C2DY6ACGDAVCNFSM6AAAAABGIC5M4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRUG42DSMBTGE. You are receiving this because you commented.

That's awesome! Glad I could help!

geomorphR / geomorph

gm.prcomp optimization for large numbers of species #48