escamero / mirlyn

11 stars 3 forks source link

How to proceed with the 'betamatPCA' funtion when the phyloseq object is very large? #8

Open monis4567 opened 1 year ago

monis4567 commented 1 year ago

I used the 'mirl' function on my phyloseq object named ps05 mirl_ps05 <- mirl::mirl(ps05, libsize = 1000, rep = 100, set.seed = 120) To create a mirl object. Examining the mirl object I can see it holds: A list of 100 phyloseq objects, where each phyloseq object contains: otu table [7212 237] tax table [7212 27] sam data [237 * 22]

I now wish to use the 'betamatPCA' function . Perhaps like this:

betamatPCA_ps05 <- mirlyn::betamatPCA(mirl_ps05, dsim = "bray")

But this appears to be impossible on my computer. Rstudio continues to crash. I assume this happens because the PCA analysis I am requesting is too heavy.

Do you have any good advice on how to proceed with the 'betamatPCA' when the phyloseq object is very large? I guess it will end up being a common issue since DNA metabarcoding very often ends up with very large otu tables and very large tax tables.

My otu, tax and sam tables can be obtained from these dropbox weblinks https://www.dropbox.com/scl/fi/lck08gsme1b3kiqyyl8en/table_df_tax05.csv?rlkey=n5c1qny7tdrm39la6eb42tiz0&dl=0 https://www.dropbox.com/scl/fi/r4okjl5mefg54cvixdcvt/table_df_sam05.csv?rlkey=t0trgzi77jybreslsnnvial2p&dl=0 https://www.dropbox.com/scl/fi/w3emynu0ohor0iynpcfkw/table_df_otu05.csv?rlkey=gv4g4eo7z5nj10bk2w2bzc6k7&dl=0

bjmt commented 1 year ago

Hi,

Thanks for the detailed info and data. I was able to recreate what you did and determine where the issue is. I have some comments and suggestions, hopefully they are of help to you.

First, running the mirl function with the settings you indicate takes under 4 minutes on my laptop. Then, I ran the component functions inside betamatPCA. First the mirlyn:::repotu_df function is called, which converts the list of 100 phyloseq objects into a single combined matrix. This takes about 2 seconds. Then, the data is transformed with vegan::decostand, which takes about 11 seconds. Next, it runs the vegan::vegdist function. After about 15-20 min it still wasn't finished, so I stopped it. I assume this is where RStudio crashed for you.

Unfortunately after looking at the source code for the vegan::vegdist function I don't think I could do a much better job of improving the efficiency of the code. Simply having to calculate an all-by-all distance matrix from 23700 samples (since you are using 100 reps) means it will take a long time no matter what. So at this point I could think of two possibilities: (1) Use a subset of OTUs instead of all 7212. I tried with the top 1000 most variable OTUs only and was able to have the vegan::vegdist function finish in 6 minutes. Unfortunately, the next step in betamatPCA is to use the stats::prcomp function to do the actual PCA, but this failed due to insufficient memory. (2) Reduce the number of reps. I tried with 10, and was able to run the entire betamatPCA function in about 2 minutes (see plot from subsequently running betamatPCAvis).

In conclusion, unless you have enough memory to satisfy the stats::prcomp function when run using a 23700x23700 distance matrix, I would suggest simply reducing the number of reps. (You could either run mirl with a lower rep number, or simply give betamatPCA a random subset of 10 of the 100 phyloseq objects from the mirl output.)

Please let me know if there was anything that wasn't clear or if you have further questions. Feel free to close the issue otherwise.

p