adw96 / DivNet

diversity estimation under ecological networks
83 stars 18 forks source link

Speeding up run time on wide datasets #32

Closed amorris28 closed 3 years ago

amorris28 commented 5 years ago

Moved over from twitter.

I'm trying to run divnet on ASVs with a dataset of 44 samples and 19,921 ASVs. No ASVs appear in all samples so I've chosen a reference ASV that is present in 42 of the 44 indicated by ref_otu. I'm also leaving X = NULL with no design matrix so I'm just trying to estimate diversity and confidence intervals for each sample. physeq is my phyloseq object. If I run this on a cluster with 28 cores and 128 GB of memory, I don't see any progress after ~30 minutes. Running locally on my 4 core, 16 GB machine it crashes, I think because it runs out of memory. Function call below:

asv_div <- divnet(physeq, ncores = 28, base = ref_otu)

Thank you for the help on this!

adw96 commented 5 years ago

Hi Andrew! Thanks so much again for using DivNet. Some thoughts

This is a great test case for us so thanks for bringing it to our attention! Never in my wildest dreams did I think that someone would try to run this with 20k taxa. (My imagination stops at around 5k.) I guess I need to work with more soil!

Amy

adw96 commented 5 years ago

@bryandmartin Anything you want to add?

amorris28 commented 5 years ago

Hey Amy!

I'm glad this is a helpful case for you all. I would love to use DivNet going forward and this is not an atypical data set for our lab group so getting to know how to make it work will be super helpful. I will try network='diagonal' and playing with the tuning argument to see how things work. Let me know how your simulations go.

Thank you for the quick turn-around! Andrew

adw96 commented 5 years ago

Ok a quick update (a bigger sim to come): time-vs-q.pdf

Conclusions:

I'm upscaling q (number of taxa) and will see how the trends continue.

mooreryan commented 5 years ago

I was having a similar issue to the original poster (see issue #28). Large number of ASV/OTU/taxa really aren't feasible it seems.

I've also found that the ncores option really doesn't provide much benefit.

In a comment on a previous pull request (https://github.com/adw96/DivNet/pull/29#issuecomment-510617485), I found that the MCrow (and MCmat) functions are taking the most CPU time, so today, I started work on rewriting those functions in Rcpp. Still working some kinks out of it, but it's definitely faster.

mooreryan commented 5 years ago

This might be helpful for the original poster as well... While working on speeding up the divnet function, I made this little graph of how number of taxa scales with time. The dataset is the included Lee dataset.

If that trend holds for very large numbers of taxa (not sure if it actually would), then running ~20,000 ASVs would take at least a couple of hours.

ntaxa_vs_time

adw96 commented 5 years ago

This is fantastic to know, @mooreryan! EM-MH algorithms are really well-suited to Rcpp but we just haven't been able to prioritise rewriting it. We would be so rapt if you were to implement it, and we would love to add you as a package coauthor/maintainer.

ch16S commented 2 years ago

Hey everyone,

Really appreciate the work everyone has on divnet. Amy, you mentioned that a diagonal matrix is most appropriate for a large number of taxa and small number of samples, as you cannot reliably estimate the interactions.

Do you think this holds true if I have 1000-2000 samples? The samples are from soil, and are geographically diverse.

Cheers, Chris