Talos produces too much data. Check all process and see how to reduce the dataout.

Thomieh73 commented 4 years ago

Running Talos on a full air-sample dataset caused again an overload of data on nn9305k which then caused our quota to be blocked. This has to be fixed asap.

Thomieh73 commented 4 years ago

There are a few things I am going to solve here.

Change the non-pareil settings to : X= 100000 , n = 2048
change the cpu usage for the jobs with the label medium from 4 to 8.
Remove publishing of intermediate jobs for the processes, so that I will only keep the final clean dataset.
- run_trim
- run_low_complex
- remove Phix
  1. I also remove the output fastq files from Kraken2. I only need to keep the files .report and .out. With the later one I can always identify the reads that were unclassified or classified. Removing the files:
    - .unclassified..fastq.gz
    - .classified..fastq.gz The *.out file will be compressed with gzip to reduce the size of the file.

Thomieh73 commented 4 years ago

Steps above are

changed non-pareil settings
Changed cpus used for medium and small jobs. both were doubled to respectively 8 and 2.
remove publishing directories and renamed the directories that are now created. 4 . removed the fastq output files from the kraken run.

In addition, I noticed that average genome size calculation is taking a long time with normal datasets. That is due to the fact that it is using all the reads to identify genes matching the proteins in the database. I have now set it so that microbecensus will only sample 5 million reads of each dataset. That reduces the time for running this step.

I have also increase the minimum time for each normal job, from 1h to 2 hrs. That should reduce the number of jobs being killed, due to timelimit. Restarts will than use 4 and than 8 hrs.

Thomieh73 commented 4 years ago

This is solved.

NorwegianVeterinaryInstitute / Talos

Talos produces too much data. Check all process and see how to reduce the dataout. #37