Closed BirgitRijvers closed 1 year ago
Hi Birgit, thanks for your kind words.
In my opinion, if you have enough RAM memory available (and if your 60 samples do not have too many reads), it would be easier to analyse them all at once. My suggestion would be to take advantage of Nextflow resource monitoring capabilities, either using Nextflow Tower or adding -with-report metontiime_report.html
to the command line. Looking at the Resources section of the report, you will find the RAM memory usage split by process. You may consider starting with the small toy dataset and then analyse a subset of your samples (e.g. start with 10 samples) and, in case, everything works, scale up to the full dataset. Since, given a database and a classifier, the number of reads is the main factor impacting on RAM usage, you may be able to find a relationship between number of reads and memory usage, finding the maximum number of reads which can be analysed on your machine in one shot (and, accordingly, the number of samples in your project). If you do any tests, please keep me informed, I'm interested into this too, but never had the time to evaluate this properly.
Best,
SM
Hi Simone,
Analyzing all samples at once is something that would be ideal, but most of my samples have over 200K reads. I already let the pipeline perform random downsampling, but yesterday I tried to run the pipeline on a batch of 5 samples and it took over 10 hours with Blast as classifier. I will look into optimizing the amount of CPU's and memory for each process, so hopefully I can reduce the time it takes. If not, I'll try out VSEARCH as classifier.
If I perform some experiments and find out more about the relationship between the number of reads and memory/CPU usage I will keep you posted on the results!
For now I'm monitoring what is running with htop
, but I'm wondering if there is another way to monitor the progress of the pipeline?
Thank you! Birgit
Hi, yes, the way to do it is to set-up Nextflow Tower. You just need to login at the website with your GitHub credentials and create a Token. Next, edit lines 62-66 of metontiime2.conf script to set enabled = true
and add your access token.
tower {
enabled = true
endpoint = '-'
accessToken = 'insert your token here'
}
After that, you will just need to login at the Nextflow Tower website, click on "Runs" and select your username. Let me know if you succeed in setting it up. SM
Hi Simone,
Thank you! I didn't know it was so easy to set up, but now I'm monitoring my most recent run with Nextflow tower. This will be very useful in finding the best parameters regarding CPU and memory usage for a large batch of samples.
Best, Birgit
Going out for a drink with friends will never be the same again, thanks to real-time monitoring directly on your smartphone! Are you sure you are ready to ruin your social life in this way? :) SM
What social life? :wink:
Hi Simone,
I hope you're doing well. I've performed a few MetONTIIME runs, and I've noticed a massive time difference between using VSEARCH and BLAST as classifiers. VSEARCH finishes in about an hour, but BLAST takes a roughly 16 hours for the same settings on the same samples.
I changed the default settings regarding CPU and memory, each process during both runs was allowed to use up to 16 CPUs and 20 GB of RAM.
VSEARCH is fast, but its results don't quite fit my project needs (checked with spike-ins and a sample of known composition). On the other side, BLAST is too slow, and I'm on a deadline.
Do you have any ideas on how to let a run with blast go quicker? If not, I'm afraid I have to let MetONTIIME go for my current project, but I'll definetely come back to test out some stuff in my free time, because I really like the pipeline!
Best, Birgit
Hi, I know Blast is a bit more accurate but much slower, unfortunately. This is because in QIIME2 implementation Blast does not use an indexed database. I can comprehend your choice. Best, SM
Dear @BirgitRijvers, I just updated MetONTIIME (v2.1.0) so that it is based on QIIME2 v2023.9, which allows classify-consensus-blast multithreading. It is still slower than Vsearch - given the same number of threads, but much faster than in the single-threaded version.
P.s.: for reduced running time, you may also consider setting --clusteringIdentity
to a value lower than 1 (e.g. 0.9). In this way, reads sharing high alignment identity are going to be clustered together and only one representative sequence for each cluster is going to be aligned to the database.
Best,
SM
Hi Simone,
Thank you, that sounds interesting! I'll look into it next week to see how much time blast classification now takes compared to VSEARCH.
I will also play around with --clusteringIdentity to see how much effect this has on the classification results and time.
Enjoy your weekend, Birgit
Dear Simone,
I updated the pipeline with git pull
, but now when I run the pipeline with Blast as classifier I get an error:
Error executing process > 'assignTaxonomy (1)'
Caused by:
Process `assignTaxonomy (1)` terminated with an error exit status (2)
Command executed:
mkdir -p /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/assignTaxonomy
classifier_uc=$(awk '{print toupper($0)'} <<< Blast)
if [ "$classifier_uc" == "BLAST" ]; then
qiime feature-classifier makeblastdb --i-sequences /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/importDb/db_sequences.qza --o-database /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/importDb/blastIndexedDb.qza
qiime feature-classifier classify-consensus-blast --i-query /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/derepSeq/rep-seqs.qza --i-blastdb /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/importDb/blastIndexedDb.qza --i-reference-taxonomy /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/importDb/db_taxonomy.qza --p-num-threads 20 --p-perc-identity 0.9 --p-query-cov 0.8 --p-maxaccepts 3 --p-min-consensus 0.7 --o-classification /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/assignTaxonomy/taxonomy.qza --o-search-results /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/assignTaxonomy/search_results.qza
elif [ "$classifier_uc" == "VSEARCH" ]; then
qiime feature-classifier classify-consensus-vsearch --i-query /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/derepSeq/rep-seqs.qza --i-reference-reads /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/importDb/db_sequences.qza --i-reference-taxonomy /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/importDb/db_taxonomy.qza --p-perc-identity 0.9 --p-query-cov 0.8 --p-maxaccepts 100 --p-maxrejects 100 --p-maxhits 3 --p-min-consensus 0.7 --p-strand 'both' --p-unassignable-label 'Unassigned' --p-threads 20 --o-classification /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/assignTaxonomy/taxonomy.qza --o-search-results /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/assignTaxonomy/search_results.qza
else
echo "Classifier Blast is not supported (choose between Blast and Vsearch)"
fi
qiime metadata tabulate --m-input-file /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/assignTaxonomy/taxonomy.qza --o-visualization /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/assignTaxonomy/taxonomy.qzv
qiime taxa filter-table --i-table /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/derepSeq/table.qza --i-taxonomy /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/assignTaxonomy/taxonomy.qza --p-exclude Unassigned --o-filtered-table /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/derepSeq/table-no-Unassigned.qza
Command exit status:
2
Command output:
(empty)
Command error:
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
Error: QIIME 2 plugin 'feature-classifier' has no action 'makeblastdb'.
Any ideas on what causes this error and how I can fix it?
Best, Birgit
Hi, you should also pull the updated docker/singularity image from DockerHub. To do that, you should delete the image you have in the cache, either removing the img file or doing something like docker rmi <image tag>
.
Best,
SM
Hi Simone,
Thank you!
A run with the updated pipeline and Blast as classifier on 5 samples now took 2 hours and 17 minutes instead of 16 hours!
maxNumReads
was set to 50000 and clusteringIdentity
was still set to 1. Again, each process was allowed to use up to 16 CPUs and 20 GB of RAM.
The updated version that supports multi threading for Blast definitely is faster, so I will be testing out your pipeline some more on my samples😄.
Thanks again, Birgit
Hi Simone,
Thank you for developing MetONTIIME and actively extending support to users. This can be rare to see in the bioinformatics community, so your effort is truly appreciated!
After a successful run on the provided demo data, I'm now want to employ MetONTIIME on my own dataset comprising approximately 60 samples. I noticed your suggestion to split large batches into smaller ones for running MetONTIIME and then combining outputs with QIIME2 commands in this comment. I'm working on a Python script to automate MetONTIIME runs, so I can choose between running the samples in batches or one at a time. I'm particularly keen on exploring QIIME2's diversity analysis features, so the output data has to be merged after the pipeline completes for all samples.
Do you think it's still best to split the samples into batches, or would running them one at a time be better or more efficient in my case?
Many thanks, Birgit