ctmrbio / BACTpipe

BACTpipe: An assembly and annotation pipeline for bacterial genomics
https://bactpipe.readthedocs.org
MIT License
20 stars 7 forks source link

Limit MultiQC search options #145

Closed thorellk closed 3 years ago

thorellk commented 3 years ago

We had an incident where the job executing the MultiQC process timed out and I therefore have a question. MultiQC is currently executed like this:

multiqc . --filename multiqc_report.html

This should mean that MultiQC searches all the file tree below where the script is executed but maybe this is overrided by that the module input is only the path('*.json') and path('*_prokka') files? If this is the case maybe we should limit it even further to path('*_prokka/*txt') files to make it run as fast as possible? Please let me know if I am correct in my interpretation how it's executed.

abhi18av commented 3 years ago

Hi @thorellk ,

I think that there must be some options which allow us to tweak this behavior https://multiqc.info/docs/

In case it doesn't then it'd be really helpful if you can indicate the files which you'd like to be used by MultiQC (such as prokka/*txt etc) in a file tree format like

- prokka
----1.txt
----2.txt

-xyz.json

And attach a sample MultiQC report so that can incorporate the changes and make sure that we are still generating the information we need.

thorellk commented 3 years ago

For each sample, the files for MultiQC are: fastp/${pair_id}.fastp.json and prokka/${pair_id}_prokka/${pair_id}.txt

So basically the total input files would be covered by fastp/* and prokka/*/*.txt

abhi18av commented 3 years ago

Cool, let me raise a PR for it soon :)

boulund commented 3 years ago

@thorellk You are correct that MultiQC executed like multiqc . will look for files matching the predefined matching patterns in MultiQC to determine which files to parse and include in the report. However, just like you mentioned, since we have configured it so that the MULTIQC process only takes the fastp reports and the prokka outputs there shouldn't be much else in the process work directory than what we want it to find (https://github.com/ctmrbio/BACTpipe/blob/master/bactpipe.nf#L90).

How many samples did you run to make it hit the time limit?

boulund commented 3 years ago

My guess (without any checking) is that it is slow because MultiQC reads through all files in the prokka output folders to see if any information in the files match, as it seems we are unlucky enough that pretty much all files in these directories are large, but small enough not to be ignored by MultiQC (it by default ignores files 5MB+, and on the test set I just ran about 88% of the files are smaller than 5MB).

But it should be easy to fix, just like you said https://multiqc.info/docs/#choosing-where-to-scan

This is the content of the work directory for the multiqc process for one of my test runs:

(base) [boulund@c1hitachi10 e092a9b00c8550f42ab15d59a71ab1]$ ls
10_prokka  14_prokka  1_prokka  5_prokka  9_prokka             12.json  16.json  3.json  7.json
11_prokka  15_prokka  2_prokka  6_prokka  multiqc_report_data  13.json  17.json  4.json  8.json
12_prokka  16_prokka  3_prokka  7_prokka  10.json              14.json  1.json   5.json  9.json
13_prokka  17_prokka  4_prokka  8_prokka  11.json              15.json  2.json   6.json  multiqc_report.html

(base) [boulund@c1hitachi10 e092a9b00c8550f42ab15d59a71ab1]$ ls 10_prokka
it162.err  it162.ffn  it162.fsa  it162.gff  it162.sqn  it162.tsv
it162.faa  it162.fna  it162.gbk  it162.log  it162.tbl  it162.txt

Thus:

multiqc *.json */*.txt

should minimize the risk of timeouts (and perhaps we should raise the requested time a bit)

boulund commented 3 years ago

I'm testing these changes right now. Will make a PR to develop as soon as I've verified that it works.

abhi18av commented 3 years ago

This issue is resolved by https://github.com/ctmrbio/BACTpipe/pull/158