adw96 / DivNet

diversity estimation under ecological networks
83 stars 18 forks source link

Cannot allocate vector of size with suspiciously large phyloseq object #115

Closed cramjaco closed 2 years ago

cramjaco commented 2 years ago

Hey Willis lab, I'm trying to run DivNet on a count table that has perhaps a suspiciously high number of unique taxa [69865 taxa, 69 samples]. Per another issue on this site that I can no longer find, I told div-net to not bother with network methods, as follows

divnet_asv <- divnet(pscb_oneplus, ncores = 12, tuning = "fast", network = "diagonal")

I'm getting the error message: Error: cannot allocate vector of size 190.8 Gb

Where 190 GB is the memory of the cluster node that I'm using.

I realize I have some options for work-arounds.

In any case, I was wondering if there are options for attempting to process this many samples in DivNet, or if I should go to the work-arounds next.

Thanks for any suggestions.

-Jacob

msmcfarlin commented 2 years ago

Hi Jacob,

I am not a DivNet developer, so take my comments with that caveat in mind. Also, I'm not sure about network methods so I'll leave any comment on that to a developer.

That taxa count does seem quite high for 69 samples though without knowing your study system it's hard to say if you should be suspicious of that. You might comment on the DADA2 page for usage of DADA2 with your data set.

The error, "cannot allocate vector of size...", occurs when the object you are trying to load into divnet() has a vector size larger than your memory limit. From what you said, it sounds like the memory limit on the node you're using is 190Gb and your data, or your data plus whatever other objects are in your R environment exceeds that.

Some of these might help...

Best, -Mike

mooreryan commented 2 years ago

Yeah I would also be suspicious of ~70,000 ASV from 70 samples...it seems abnormally high.

But for arguments sake, let's assume you do have 70,000 good ASVs. The number of samples isn't what's causing the huge resource usage...it's the high number of taxa. Even the rust version will take a while and use a decent amount of memory on a dataset with 70,000 taxa. If I have more than a few 1000 taxa, I generally switch to the Rust version.

Alternatively, you can try collapsing your ASVs to a higher level with tax_glom or somthing similar, to get the taxa to a more manageable level.

cramjaco commented 2 years ago

Thanks! Yeah, the rust version has been crashing on me too. I'll take it up with those developers next. I'm beginning to think that there may actually be 70,000 ASVs, since I found another dataset from the chesapeake bay that has 300k ASVs in it. Not a huge fan of using tax glom, since I'd really prefer ASV level shannon index, rather than some other level shannon index.

cramjaco commented 2 years ago

Oh, wait, I'm talking to @mooreryan -- you are the divnet-rs developer. I'm having the same problem in divnet-rs. That again overloads the memory allocation that I give it on the cluster (~180TB) and then the job gets killed. Is it worth opening an issue over on divnet-rs, or should I just not try to calculate divnet indexes on these highly "diverse" datasets?

cramjaco commented 2 years ago

Regarding @msmcfarlin 's suggestion about subsetting. Is it ok to run divnet or divnet-rs on each sample (or small set of samples) seperately (assuming I'm not using the network features)? That might get me around the memory issues.

mooreryan commented 2 years ago

If you would like, feel free to open an issue on the divnet-rs github and we can try and figure out what's going on there.