NorwegianVeterinaryInstitute / Talos

A shotgun metagenomic analysis pipeline using nextflow
BSD 3-Clause "New" or "Revised" License
1 stars 2 forks source link

Talos produces too much data. Check all process and see how to reduce the dataout. #37

Closed Thomieh73 closed 4 years ago

Thomieh73 commented 4 years ago

Running Talos on a full air-sample dataset caused again an overload of data on nn9305k which then caused our quota to be blocked. This has to be fixed asap.

Thomieh73 commented 4 years ago

There are a few things I am going to solve here.

  1. Change the non-pareil settings to : X= 100000 , n = 2048
  2. change the cpu usage for the jobs with the label medium from 4 to 8.
  3. Remove publishing of intermediate jobs for the processes, so that I will only keep the final clean dataset.
    • run_trim
    • run_low_complex
    • remove Phix
      1. I also remove the output fastq files from Kraken2. I only need to keep the files .report and .out. With the later one I can always identify the reads that were unclassified or classified. Removing the files:
        • .unclassified..fastq.gz
        • .classified..fastq.gz The *.out file will be compressed with gzip to reduce the size of the file.
Thomieh73 commented 4 years ago

Steps above are

  1. changed non-pareil settings
  2. Changed cpus used for medium and small jobs. both were doubled to respectively 8 and 2.
  3. remove publishing directories and renamed the directories that are now created. 4 . removed the fastq output files from the kraken run.

In addition, I noticed that average genome size calculation is taking a long time with normal datasets. That is due to the fact that it is using all the reads to identify genes matching the proteins in the database. I have now set it so that microbecensus will only sample 5 million reads of each dataset. That reduces the time for running this step.

I have also increase the minimum time for each normal job, from 1h to 2 hrs. That should reduce the number of jobs being killed, due to timelimit. Restarts will than use 4 and than 8 hrs.

Thomieh73 commented 4 years ago

This is solved.