Vsearch Dereplicate - Githubissues

MaestSi / MetONTIIME

A Meta-barcoding pipeline for analysing ONT data in QIIME2 framework

GNU General Public License v3.0

78 stars 17 forks source link

Vsearch Dereplicate #46

Closed timyaro closed 2 years ago

timyaro commented 2 years ago

Hello,

Sorry if this question is naive.

VSearch dereplicate part of the metontiime.sh code is taking a really long time (16 hours). I was wondering about how long is this approximately going to take with the following specs listed below:

100 fastq and fastq.gz files averaging about 50k reads with N50 of 1.5 kbp per file (trimmed with porechop and filtered with nanofilt)
NCBI 16S database
16 threads specified
Process is running on an octacore computer (AMD Ryzen 4800H) with 16 Gb of RAM

Let me know if you have any suggestions or thoughts!

MaestSi commented 2 years ago

Hi timyaro, I can't give a precise estimate, but for sure that is a quit big amount of reads (roughly 5M reads), I expect this process might take a couple of days to complete. Unfortunately, pipelines based on single-read alignment are quite slow. I think from "htop" command you may be able to see where temporary files are being stored, and use that file to count the amount of reads (1 read -> 1 row) that have already been processed. SM

MaestSi commented 2 years ago

Hi, I am closing this issue due to inactivity. In case you have any further issues, please reopen it. SM

timyaro commented 2 years ago

Hello @MaestSi . I appreciate the reply! I have reached a new obstacle.

I am currently using Ubuntu and I got a stop error for cluster-features-de-novo (stopped by itself and I have no idea why). I got 2.8 million reads out of my 3.46 million reads processed with the tmp file and contents still in the tmp folder (this processing took 9 days). I was wondering if there's any way for it to pick up where it left off or do I have to restart the entire process again?

I'm almost tempted to scale it to AWS for 96 vCPU and an absurd amount of memory and was wondering if this would reduce it down to a couple hours instead of days? I don't want to resort to this because of costs.

MaestSi commented 2 years ago

Hi @timyaro , I fear there is no easy way to resume it from where it crashed, as fare as I know. I think the error may be due to not enough RAM memory available. Indeed, 16GB is quite a low amount of RAM. My advice would be to re-run the analysis by random sampling 10k-30k reads per sample first. In parallel, you may try running it on AWS, but I am not confident the reduction in run time would be so drastic. SM