Closed thorellk closed 3 years ago
Hi @thorellk ,
I think that there must be some options which allow us to tweak this behavior https://multiqc.info/docs/
In case it doesn't then it'd be really helpful if you can indicate the files which you'd like to be used by MultiQC (such as prokka/*txt
etc) in a file tree format like
- prokka
----1.txt
----2.txt
-xyz.json
And attach a sample MultiQC report so that can incorporate the changes and make sure that we are still generating the information we need.
For each sample, the files for MultiQC are:
fastp/${pair_id}.fastp.json
and prokka/${pair_id}_prokka/${pair_id}.txt
So basically the total input files would be covered by fastp/*
and prokka/*/*.txt
Cool, let me raise a PR for it soon :)
@thorellk You are correct that MultiQC executed like multiqc .
will look for files matching the predefined matching patterns in MultiQC to determine which files to parse and include in the report. However, just like you mentioned, since we have configured it so that the MULTIQC process only takes the fastp reports and the prokka outputs there shouldn't be much else in the process work directory than what we want it to find (https://github.com/ctmrbio/BACTpipe/blob/master/bactpipe.nf#L90).
How many samples did you run to make it hit the time limit?
My guess (without any checking) is that it is slow because MultiQC reads through all files in the prokka output folders to see if any information in the files match, as it seems we are unlucky enough that pretty much all files in these directories are large, but small enough not to be ignored by MultiQC (it by default ignores files 5MB+, and on the test set I just ran about 88% of the files are smaller than 5MB).
But it should be easy to fix, just like you said https://multiqc.info/docs/#choosing-where-to-scan
This is the content of the work directory for the multiqc process for one of my test runs:
(base) [boulund@c1hitachi10 e092a9b00c8550f42ab15d59a71ab1]$ ls
10_prokka 14_prokka 1_prokka 5_prokka 9_prokka 12.json 16.json 3.json 7.json
11_prokka 15_prokka 2_prokka 6_prokka multiqc_report_data 13.json 17.json 4.json 8.json
12_prokka 16_prokka 3_prokka 7_prokka 10.json 14.json 1.json 5.json 9.json
13_prokka 17_prokka 4_prokka 8_prokka 11.json 15.json 2.json 6.json multiqc_report.html
(base) [boulund@c1hitachi10 e092a9b00c8550f42ab15d59a71ab1]$ ls 10_prokka
it162.err it162.ffn it162.fsa it162.gff it162.sqn it162.tsv
it162.faa it162.fna it162.gbk it162.log it162.tbl it162.txt
Thus:
multiqc *.json */*.txt
should minimize the risk of timeouts (and perhaps we should raise the requested time a bit)
I'm testing these changes right now. Will make a PR to develop as soon as I've verified that it works.
This issue is resolved by https://github.com/ctmrbio/BACTpipe/pull/158
We had an incident where the job executing the MultiQC process timed out and I therefore have a question. MultiQC is currently executed like this:
multiqc . --filename multiqc_report.html
This should mean that MultiQC searches all the file tree below where the script is executed but maybe this is overrided by that the module input is only the
path('*.json')
andpath('*_prokka')
files? If this is the case maybe we should limit it even further topath('*_prokka/*txt')
files to make it run as fast as possible? Please let me know if I am correct in my interpretation how it's executed.