Closed tjlim10 closed 10 months ago
Dear Timothy,
your analysis and the attempts you made to solve the issue seem right to me. You may try further splitting the 44 samples in, for example, 4 groups of 11 samples. As an alternative, you could try either reducing the number of reads (process downsampleFastq = true
and maxNumReads=10000
, for example) or performing clustering with lower identity (e.g. clusteringIdentity=0.9
). I suggest setting up Nextflow Tower. You just need to login at the website with your GitHub credentials and create a Token. Next, edit lines 64-68 of metontiime2.conf script to set enabled = true
and add your access token.
tower {
enabled = true
endpoint = '-'
accessToken = 'insert your token here'
}
After that, you will just need to login at the Nextflow Tower website, click on "Runs" and select your username. Running the whole pipeline with a subset of the reads/samples, you will be able to track the amount of RAM memory used by assignTaxonomy process and extrapolate the required amount of RAM VS number of reads. Best, SM
Hi SM,
Thanks so much for your reply. I tried to re-run using 11 samples, unfortunately it still ran out of memory. I still received the same error 137.
At this stage, I am trying to do two parallel runs: 1) running with just 2 samples; 2) running with clusteringIdentity=0.9
. I will keep you updated.
Thanks!
Kind regards, Timothy
Hi SM,
Just an update on my "running with just 2 samples". I have managed to successfully complete the run (for the first time, yay! And it proves that there's nothing wrong with the configuration), however the memory usage for 2 samples is just way too demanding (these two samples contain 50,000 reads each). Since I have roughly 44 samples in Nanopore sequencing, may I know if it's more sustainable to just reduce the number of reads (if yes, is 10000 the optimal no. of reads between retained data and required resources)/perform clustering with lower identity instead of running in smaller batches and combining them later on? Running in smaller batches sound super time consuming to me. Happy to hear about your thoughts. Thanks!
Update: The 44 samples run with clusteringIdentity=0.9
managed to run successfully with much lower resources, and the assignTaxonomy step only consumes 157GB (much smaller than expected)! See screenshot below:
Therefore, just wondering from your point of view, is it better to reduce number of reads or perform clustering with lower identity if I'm aiming at retaining most of amount of data? Thanks!
Kind regards, Timothy
Dear Timothy, I would just try to analyse a couple of samples (say 2) with:
I would then take a decision based on which of the two latter options looks more similar to the full dataset analysis. For this aim, you may use the genus- or species-level counts and evaluate pairwise correlation. If you do not have time to do this additional analysis, I would personally go for the downsampling strategy. Best, SM
Hi SM,
Thanks so much for the suggestion. I will try to have a look and get back to you on my findings ;)
Kind regards, Timothy
Hi SM,
I had a look at the comparison between downSampling and clusteringIdentity_. By just looking at the taxonomy bar plot, the downSampling % assigned is close to the original dataset; for the clusteringIdentity, the feature table looked so different, such that the unassigned proportion is higher than the assigned proportion (which is opposite of the original proportion). Hence, I agree with you that downSampling strategy is better than clusteringIdentity!
Thanks a lot for the help!
Kind regards, Timothy
Thank you, I'm trying downsampling now. Do you know if there is a low memory phase followed by a high memory phase? It's currently running, so perhaps I should wait until it's done to ask this question, but I'm seeing high CPU usage but hardly any memory usage. --Ilana
I'd expect some kind of incremental memory usage, but I'm not sure about that. Best, SM
Is there any way to have it check only a subset of the reads in each cluster for estimating % agreement with the consensus taxonomy (line 28)? It seems that this is only reason that this step is so dependent on the sampling depth. I essentially want to be able to wait and down sample after the derepseq step. Do you think a) this is possible and b) that this would be a reasonable approach? -Ilana
I am not sure I understood what you would like to do. If you want to "downsample" after the derepSeq step, practically, you need to decrease minimum clustering identity. In this way, also less similar reads will be clustered in the same cluster, giving rise to fewer representative sequences, which will be aligned to the database in the following process. SM
Hello,
Happy new year and I hope you are well!
I was trying to run MetONTiime on my university's HPC with Singularity, on 44 samples (using SILVA 138 database for 16S rRNA). Everything ran well until it reached AssignTaxonomy step, where I encountered error exit status 137 (assuming that to be related with memory requirement). I used Vsearch classifier. Please see the following error:
I have tried the following options:
However, there was no luck in solving the issue.
May I know if you are able to help with this? Please let me know if you need the NextFlow configuration file or any other information from me.
Thanks in advance!
Kind regards, Timothy