Large dataset advice - Githubissues

BirgitRijvers commented 1 year ago

Hi Simone,

Thank you for developing MetONTIIME and actively extending support to users. This can be rare to see in the bioinformatics community, so your effort is truly appreciated!

After a successful run on the provided demo data, I'm now want to employ MetONTIIME on my own dataset comprising approximately 60 samples. I noticed your suggestion to split large batches into smaller ones for running MetONTIIME and then combining outputs with QIIME2 commands in this comment. I'm working on a Python script to automate MetONTIIME runs, so I can choose between running the samples in batches or one at a time. I'm particularly keen on exploring QIIME2's diversity analysis features, so the output data has to be merged after the pipeline completes for all samples.

Do you think it's still best to split the samples into batches, or would running them one at a time be better or more efficient in my case?

Many thanks, Birgit

MaestSi commented 1 year ago

Hi Birgit, thanks for your kind words. In my opinion, if you have enough RAM memory available (and if your 60 samples do not have too many reads), it would be easier to analyse them all at once. My suggestion would be to take advantage of Nextflow resource monitoring capabilities, either using Nextflow Tower or adding -with-report metontiime_report.html to the command line. Looking at the Resources section of the report, you will find the RAM memory usage split by process. You may consider starting with the small toy dataset and then analyse a subset of your samples (e.g. start with 10 samples) and, in case, everything works, scale up to the full dataset. Since, given a database and a classifier, the number of reads is the main factor impacting on RAM usage, you may be able to find a relationship between number of reads and memory usage, finding the maximum number of reads which can be analysed on your machine in one shot (and, accordingly, the number of samples in your project). If you do any tests, please keep me informed, I'm interested into this too, but never had the time to evaluate this properly. Best, SM

BirgitRijvers commented 1 year ago

Hi Simone,

Analyzing all samples at once is something that would be ideal, but most of my samples have over 200K reads. I already let the pipeline perform random downsampling, but yesterday I tried to run the pipeline on a batch of 5 samples and it took over 10 hours with Blast as classifier. I will look into optimizing the amount of CPU's and memory for each process, so hopefully I can reduce the time it takes. If not, I'll try out VSEARCH as classifier.

If I perform some experiments and find out more about the relationship between the number of reads and memory/CPU usage I will keep you posted on the results!

For now I'm monitoring what is running with htop, but I'm wondering if there is another way to monitor the progress of the pipeline?

Thank you! Birgit

MaestSi commented 1 year ago

Hi, yes, the way to do it is to set-up Nextflow Tower. You just need to login at the website with your GitHub credentials and create a Token. Next, edit lines 62-66 of metontiime2.conf script to set enabled = true and add your access token.

tower {
    enabled = true
    endpoint = '-'
    accessToken = 'insert your token here'
}

After that, you will just need to login at the Nextflow Tower website, click on "Runs" and select your username. Let me know if you succeed in setting it up. SM

BirgitRijvers commented 1 year ago

Hi Simone,

Thank you! I didn't know it was so easy to set up, but now I'm monitoring my most recent run with Nextflow tower. This will be very useful in finding the best parameters regarding CPU and memory usage for a large batch of samples.

Best, Birgit

MaestSi commented 1 year ago

Going out for a drink with friends will never be the same again, thanks to real-time monitoring directly on your smartphone! Are you sure you are ready to ruin your social life in this way? :) SM

BirgitRijvers commented 1 year ago

What social life? :wink:

BirgitRijvers commented 1 year ago

Hi Simone,

I hope you're doing well. I've performed a few MetONTIIME runs, and I've noticed a massive time difference between using VSEARCH and BLAST as classifiers. VSEARCH finishes in about an hour, but BLAST takes a roughly 16 hours for the same settings on the same samples.

I changed the default settings regarding CPU and memory, each process during both runs was allowed to use up to 16 CPUs and 20 GB of RAM.

VSEARCH is fast, but its results don't quite fit my project needs (checked with spike-ins and a sample of known composition). On the other side, BLAST is too slow, and I'm on a deadline.

Do you have any ideas on how to let a run with blast go quicker? If not, I'm afraid I have to let MetONTIIME go for my current project, but I'll definetely come back to test out some stuff in my free time, because I really like the pipeline!

Best, Birgit

MaestSi commented 1 year ago

Hi, I know Blast is a bit more accurate but much slower, unfortunately. This is because in QIIME2 implementation Blast does not use an indexed database. I can comprehend your choice. Best, SM

MaestSi commented 1 year ago

Dear @BirgitRijvers, I just updated MetONTIIME (v2.1.0) so that it is based on QIIME2 v2023.9, which allows classify-consensus-blast multithreading. It is still slower than Vsearch - given the same number of threads, but much faster than in the single-threaded version. P.s.: for reduced running time, you may also consider setting --clusteringIdentity to a value lower than 1 (e.g. 0.9). In this way, reads sharing high alignment identity are going to be clustered together and only one representative sequence for each cluster is going to be aligned to the database. Best, SM

BirgitRijvers commented 1 year ago

Hi Simone,

Thank you, that sounds interesting! I'll look into it next week to see how much time blast classification now takes compared to VSEARCH.

I will also play around with --clusteringIdentity to see how much effect this has on the classification results and time.

Enjoy your weekend, Birgit

BirgitRijvers commented 12 months ago

Dear Simone,

I updated the pipeline with git pull, but now when I run the pipeline with Blast as classifier I get an error:

Error executing process > 'assignTaxonomy (1)'

Caused by:
  Process `assignTaxonomy (1)` terminated with an error exit status (2)

Command executed:

  mkdir -p /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/assignTaxonomy

    classifier_uc=$(awk '{print toupper($0)'} <<< Blast)

    if [ "$classifier_uc" == "BLAST" ]; then
        qiime feature-classifier makeblastdb            --i-sequences /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/importDb/db_sequences.qza           --o-database /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/importDb/blastIndexedDb.qza

        qiime feature-classifier classify-consensus-blast           --i-query /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/derepSeq/rep-seqs.qza           --i-blastdb /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/importDb/blastIndexedDb.qza           --i-reference-taxonomy /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/importDb/db_taxonomy.qza           --p-num-threads 20          --p-perc-identity 0.9           --p-query-cov 0.8           --p-maxaccepts 3            --p-min-consensus 0.7           --o-classification /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/assignTaxonomy/taxonomy.qza            --o-search-results /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/assignTaxonomy/search_results.qza
    elif [ "$classifier_uc" == "VSEARCH" ]; then
        qiime feature-classifier classify-consensus-vsearch             --i-query /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/derepSeq/rep-seqs.qza           --i-reference-reads /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/importDb/db_sequences.qza             --i-reference-taxonomy /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/importDb/db_taxonomy.qza           --p-perc-identity 0.9           --p-query-cov 0.8           --p-maxaccepts 100          --p-maxrejects 100          --p-maxhits 3           --p-min-consensus 0.7           --p-strand 'both'           --p-unassignable-label 'Unassigned'             --p-threads 20          --o-classification /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/assignTaxonomy/taxonomy.qza            --o-search-results /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/assignTaxonomy/search_results.qza
    else
        echo "Classifier Blast is not supported (choose between Blast and Vsearch)"
    fi

  qiime metadata tabulate       --m-input-file /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/assignTaxonomy/taxonomy.qza        --o-visualization /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/assignTaxonomy/taxonomy.qzv

    qiime taxa filter-table         --i-table /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/derepSeq/table.qza      --i-taxonomy /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/assignTaxonomy/taxonomy.qza      --p-exclude Unassigned      --o-filtered-table /mnt/TeacherFiles/research/Birgit/MetONTIIME/2324-018/01-05_blast_50k_multi_2/derepSeq/table-no-Unassigned.qza

Command exit status:
  2

Command output:
  (empty)

Command error:
  WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
  Error: QIIME 2 plugin 'feature-classifier' has no action 'makeblastdb'.

Any ideas on what causes this error and how I can fix it?

Best, Birgit

MaestSi commented 12 months ago

Hi, you should also pull the updated docker/singularity image from DockerHub. To do that, you should delete the image you have in the cache, either removing the img file or doing something like docker rmi <image tag>. Best, SM

BirgitRijvers commented 12 months ago

Hi Simone,

Thank you!

A run with the updated pipeline and Blast as classifier on 5 samples now took 2 hours and 17 minutes instead of 16 hours! maxNumReads was set to 50000 and clusteringIdentity was still set to 1. Again, each process was allowed to use up to 16 CPUs and 20 GB of RAM.

The updated version that supports multi threading for Blast definitely is faster, so I will be testing out your pipeline some more on my samples😄.

Thanks again, Birgit

MaestSi / MetONTIIME

Large dataset advice #77