Speeding up run time on wide datasets

amorris28 commented 5 years ago

Moved over from twitter.

I'm trying to run divnet on ASVs with a dataset of 44 samples and 19,921 ASVs. No ASVs appear in all samples so I've chosen a reference ASV that is present in 42 of the 44 indicated by ref_otu. I'm also leaving X = NULL with no design matrix so I'm just trying to estimate diversity and confidence intervals for each sample. physeq is my phyloseq object. If I run this on a cluster with 28 cores and 128 GB of memory, I don't see any progress after ~30 minutes. Running locally on my 4 core, 16 GB machine it crashes, I think because it runs out of memory. Function call below:

asv_div <- divnet(physeq, ncores = 28, base = ref_otu)

Thank you for the help on this!

adw96 commented 5 years ago

Hi Andrew! Thanks so much again for using DivNet. Some thoughts

I was concerned that if you're running in parallel then maybe the progress bar doesn't update because the cores don't talk to each other, but I confirmed that's not the case (since we parallelise over the MH steps but the progress bar updates each EM step)
I would recommend network="diagonal" for a dataset of this size. This means you're allowing overdispersion (compared to a plugin aka multinomial model) but not a network structure. This isn't just about computational expense -- it's about the reliability of the network estimates. Essentially estimating network structure on 20k variables (taxa) with 50 samples with any kind of reliability is going to be very challenging, and I don't think that it's worth doing here. In our simulations we basically found that overdispersion contributes the bulk of the variance to diversity estimation (i.e. overdispersion is more important than network structure), so I don't think you are going to lose too much anyway.
You can control the speed-precision trade off by varying the argument tuning. The default is list(EMiter = 6, EMburn = 3, MCiter = 500, MCburn = 250) Doing fewer EMiters and MCiters reduces runtime. Perhaps try list(EMiter = 6, EMburn = 3, MCiter = 250, MCburn = 100) If you're worried that it's stalling out entirely, to check that it runs, try list(EMiter = 6, EMburn = 3, MCiter = 10, MCburn = 5) Note that we parallelise over MCiter.
I'm running a simulation now to see how DivNet it scales with ncores and q. I'm concerned that they might be some overhead with the parallelisation and perhaps having so many cores hurts you. I'll post my results when I get them.

This is a great test case for us so thanks for bringing it to our attention! Never in my wildest dreams did I think that someone would try to run this with 20k taxa. (My imagination stops at around 5k.) I guess I need to work with more soil!

Amy

adw96 commented 5 years ago

@bryandmartin Anything you want to add?

amorris28 commented 5 years ago

Hey Amy!

I'm glad this is a helpful case for you all. I would love to use DivNet going forward and this is not an atypical data set for our lab group so getting to know how to make it work will be super helpful. I will try network='diagonal' and playing with the tuning argument to see how things work. Let me know how your simulations go.

Thank you for the quick turn-around! Andrew

adw96 commented 5 years ago

Ok a quick update (a bigger sim to come): time-vs-q.pdf

Conclusions:

no huge gains to adding many cores; 3 is about as good as 6.
no huge gain for diagonal vs naive

I'm upscaling q (number of taxa) and will see how the trends continue.

mooreryan commented 5 years ago

I was having a similar issue to the original poster (see issue #28). Large number of ASV/OTU/taxa really aren't feasible it seems.

I've also found that the ncores option really doesn't provide much benefit.

In a comment on a previous pull request (https://github.com/adw96/DivNet/pull/29#issuecomment-510617485), I found that the MCrow (and MCmat) functions are taking the most CPU time, so today, I started work on rewriting those functions in Rcpp. Still working some kinks out of it, but it's definitely faster.

mooreryan commented 5 years ago

This might be helpful for the original poster as well... While working on speeding up the divnet function, I made this little graph of how number of taxa scales with time. The dataset is the included Lee dataset.

If that trend holds for very large numbers of taxa (not sure if it actually would), then running ~20,000 ASVs would take at least a couple of hours.

ntaxa_vs_time

adw96 commented 5 years ago

This is fantastic to know, @mooreryan! EM-MH algorithms are really well-suited to Rcpp but we just haven't been able to prioritise rewriting it. We would be so rapt if you were to implement it, and we would love to add you as a package coauthor/maintainer.

ch16S commented 2 years ago

Hey everyone,

Really appreciate the work everyone has on divnet. Amy, you mentioned that a diagonal matrix is most appropriate for a large number of taxa and small number of samples, as you cannot reliably estimate the interactions.

Do you think this holds true if I have 1000-2000 samples? The samples are from soil, and are geographically diverse.

Cheers, Chris

adw96 / DivNet

Speeding up run time on wide datasets #32