MultiQC / MultiQC

Aggregate results from bioinformatics analyses across many samples into a single report.
http://multiqc.info
GNU General Public License v3.0
1.2k stars 595 forks source link

MultiQC missing paired end reads in FastQC module #1809

Closed oligomyeggo closed 1 year ago

oligomyeggo commented 1 year ago

Description of bug

Hello! I am new to using MultiQC, but it's great and I am incorporating it into a snakemake pipeline I am working on for paired end bulk RNA-seq data. It is working well for modules once the R1 and R2 files are combined into a single sample file (such as with STAR outputs). However, for the FastQC module, I cannot figure out how to get the R1 and R2 results for each sample to be displayed separately; as of now, MultiQC is only displayed the R1 files in the FastQC section of the report. I have tried messing around with different parameters in a multiqc_config.yaml file, but nothing has worked so far. I am sure this is very trivial, but I just can't figure it out.

From the log, it looks like MultiQC is only finding 3 FastQC reports, but there are six. My MultiQC rule looks like this:

rule multiqc:
    input:
        expand(
            "results/qc/fastqc/{sample}_{read}_fastqc.zip",
            sample=SAMPLES,
            read=["R1", "R2"],
        ),
        expand(
            "logs/star/{sample}.Log.final.out",
            sample=SAMPLES,
        ),
    output:
        html="results/qc/multiqc/multiqc_report.html",
    params:
        extra="-f --config 'config/multiqc_config.yaml'",
    log:
        "logs/multiqc/multiqc.log",
    message:
        "MultiQC analysis..."
    threads: 1
    wrapper:
        "v1.19.2/bio/multiqc"

And my multiqc_config.yaml file looks like this:

title: "RNA-Seq Pipeline QC Reports"

top_modules:
  - "fastqc"
  - "star"

I would expect MultiQC to find the following:

FastQC:

Star:

However, in the MultiQC report I am only getting the following:

FastQC:

Star:

How can I get the R2 files to also show up in the MultiQC report?

File that triggers the error

No response

MultiQC Error log

/// MultiQC 🔍 | v1.13

|           multiqc | Search path : /io2-2/shared/rna-seq_pipeline_test/logs/star
|           multiqc | Search path : /io2-2/shared/rna-seq_pipeline_test/results/qc/fastqc
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 94/94  
|            fastqc | Found 3 reports
|              star | Found 3 reports
|           multiqc | Compressing plot data
|           multiqc | Deleting    : results/qc/multiqc/multiqc_report_data   (-f was specified)
|           multiqc | Report      : results/qc/multiqc/multiqc_report.html
|           multiqc | Data        : results/qc/multiqc/multiqc_report_data
|           multiqc | MultiQC complete
ewels commented 1 year ago

Please check https://multiqc.info/docs/#not-enough-samples-found and also attach the verbose MultiQC log (multiqc_data/multiqc.log).

We need to get down to a minimal example to figure out what's wrong. So the next step is to remove Snakemake from the equation and try running MultiQC manually on as small a set of files as possible (2 FastQC reports that should give 2 samples but give only 1). If that's reproducible, please attach those files to this issue so that I can see it myself and have a play.

oligomyeggo commented 1 year ago

@ewels , thank you so much for you quick response! I will work on putting together a minimal example outside of snakemake. In the meantime, I checked the multiqc_data/multiqc.log file and it looks like the issue is with duplicate file names:

[2022-12-01 16:28:14,827] multiqc.modules.fastqc.fastqc                      [DEBUG  ]  Duplicate sample name found! Overwriting: Sample1_S1_L001_R1_001
[2022-12-01 16:28:14,855] multiqc.modules.fastqc.fastqc                      [DEBUG  ]  Duplicate sample name found! Overwriting: Sample2_S1_L001_R1_001

So I guess the R1s are somehow overwriting the R2s, which would relate to the section of the MultiQC documentation you shared?

ewels commented 1 year ago

Possible, but FastQC is a bit weird in that MultiQC can parse both the zip files and the raw data, and I think it does both by default. So that could be a red herring.

I imagine it comes down to something in the way that Snakemake is presenting the input files to MultiQC though. I'm a @nextflow-io person not Snakemake though so am not much help on that front sorry..

oligomyeggo commented 1 year ago

I got the same error when just grabbing two of my samples and running MultiQC out of my snakemake pipeline. I've included a zipped directory with: 1) the FastQC data (the .html and .zip files for R1 and R2 for two samples from a toy data set; the same toy data set that caused the initial issue in the snakemake pipeline), 2) the MultiQC output.

The command used was:

multiqc /home/cwinkler/multi_qc

The resulting log is:

[2022-12-01 19:22:05,345] multiqc                                            [DEBUG  ]  This is MultiQC v1.13
[2022-12-01 19:22:05,345] multiqc                                            [DEBUG  ]  Command used: /opt/anaconda3/envs/RNAseq/bin/multiqc /home/cwinkler/multi_qc -d -s
[2022-12-01 19:22:05,811] multiqc                                            [DEBUG  ]  Latest MultiQC version is v1.13
[2022-12-01 19:22:05,811] multiqc                                            [INFO   ]  Not cleaning sample names
[2022-12-01 19:22:05,811] multiqc                                            [DEBUG  ]  Working dir : /home/cwinkler/multi_qc
[2022-12-01 19:22:05,811] multiqc                                            [DEBUG  ]  Template    : default
[2022-12-01 19:22:05,811] multiqc                                            [DEBUG  ]  Running Python 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0]
[2022-12-01 19:22:05,811] multiqc                                            [INFO   ]  Prepending directory to sample names
[2022-12-01 19:22:05,811] multiqc                                            [DEBUG  ]  Analysing modules: custom_content, ccs, ngsderive, purple, conpair, lima, peddy, somalier, methylQA, mosdepth, phantompeakqualtools, qualimap, preseq, quast, qorts, rna_seqc, rockhopper, rsem, rseqc, busco, bustools, goleft_indexcov, gffcompare, disambiguate, supernova, deeptools, sargasso, verifybamid, mirtrace, happy, mirtop, sambamba, homer, hops, macs2, theta2, snpeff, gatk, htseq, bcftools, featureCounts, fgbio, dragen, dedup, pbmarkdup, damageprofiler, biobambam2, jcvi, mtnucratio, picard, vep, sentieon, prokka, qc3C, nanostat, samblaster, samtools, sexdeterrmine, eigenstratdatabasetools, bamtools, jellyfish, vcftools, longranger, stacks, varscan2, snippy, bbmap, bismark, biscuit, hicexplorer, hicup, hicpro, salmon, kallisto, slamdunk, star, hisat2, tophat, bowtie2, bowtie1, snpsplit, odgi, pangolin, kat, leehom, adapterRemoval, clipandmerge, cutadapt, flexbar, kaiju, kraken, malt, trimmomatic, sickle, skewer, sortmerna, biobloomtools, fastq_screen, afterqc, fastp, fastqc, pychopper, pycoqc, minionqc, multivcfanalyzer, clusterflow, checkqc, bcl2fastq, bclconvert, interop, ivar, flash, seqyclean, optitype, whatshap
[2022-12-01 19:22:05,812] multiqc                                            [DEBUG  ]  Using temporary directory for creating report: /tmp/tmp11o0szlq
[2022-12-01 19:22:05,916] multiqc                                            [INFO   ]  Search path : /home/cwinkler/multi_qc
[2022-12-01 19:22:06,153] multiqc                                            [DEBUG  ]  Summary of files that were skipped by the search: [skipped_module_specific_max_filesize: 44] // [skipped_no_match: 4]
[2022-12-01 19:22:06,389] multiqc.plots.bargraph                             [DEBUG  ]  Using matplotlib version 3.6.2
[2022-12-01 19:22:06,390] multiqc.plots.linegraph                            [DEBUG  ]  Using matplotlib version 3.6.2
[2022-12-01 19:22:06,390] multiqc                                            [DEBUG  ]  No samples found: custom_content
[2022-12-01 19:22:06,445] multiqc.modules.fastqc.fastqc                      [DEBUG  ]  Duplicate sample name found! Overwriting: home | cwinkler | multi_qc | fastqc_data | SRR17984708_S1_L001_R1_001.fastq.gz
[2022-12-01 19:22:06,469] multiqc.modules.fastqc.fastqc                      [DEBUG  ]  Duplicate sample name found! Overwriting: home | cwinkler | multi_qc | fastqc_data | SRR17984709_S1_L001_R1_001.fastq.gz
[2022-12-01 19:22:06,492] multiqc.modules.fastqc.fastqc                      [INFO   ]  Found 2 reports
[2022-12-01 19:22:06,521] multiqc                                            [INFO   ]  Compressing plot data
[2022-12-01 19:22:06,541] multiqc                                            [INFO   ]  Report      : multiqc_report.html
[2022-12-01 19:22:06,541] multiqc                                            [INFO   ]  Data        : multiqc_data
[2022-12-01 19:22:06,542] multiqc                                            [DEBUG  ]  Moving data file from '/tmp/tmp11o0szlq/multiqc_data' to '/home/cwinkler/multi_qc/multiqc_data'
[2022-12-01 19:22:06,594] multiqc                                            [INFO   ]  MultiQC complete

I tried running the same basic MultiQC flag using the -s and -d flags, but it still only found 2 reports (versus 4).

multi_qc.zip

ewels commented 1 year ago

Thanks @oligomyeggo - I just took a look at your example data. It looks like your _R1 and _R2 FastQC reports were both generated from the same input FastQ file:

SRR17984708_R1_fastqc.html

image

SRR17984708_R2_fastqc.html

image

(comparable for the other sample).

MultiQC is taking the sample name from this input filename, rather than the FastQC report filename, which is why it's being overwritten. I guess that you took the single end toy dataset and copied + renamed the reports to make it look like paired end or something? If you copy the FastQ files and run FastQC 4 times instead of just copying the reports, then run MultiQC, it should work 😄

Regarding why your flags didn't work:

If you put the reports into different subdirectories, then you get a report with 4 samples:

$ tree fastqc_data
fastqc_data
├── R1
│   ├── SRR17984708_R1_fastqc.html
│   ├── SRR17984708_R1_fastqc.zip
│   ├── SRR17984709_R1_fastqc.html
│   └── SRR17984709_R1_fastqc.zip
└── R2
    ├── SRR17984708_R2_fastqc.html
    ├── SRR17984708_R2_fastqc.zip
    ├── SRR17984709_R2_fastqc.html
    └── SRR17984709_R2_fastqc.zip

$ multiqc fastqc_data -d

  /// MultiQC 🔍 | v1.14.dev0 (45c5408)

|           multiqc | Prepending directory to sample names
|           multiqc | Search path : /Users/ewels/Downloads/multi_qc/fastqc_data
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 8/8
|            fastqc | Found 4 reports
|           multiqc | Compressing plot data
|           multiqc | Report      : multiqc_report.html
|           multiqc | Data        : multiqc_data
|           multiqc | MultiQC complete
ewels commented 1 year ago

Ah and I just remembered - there is an option to use the filename as the sample name, not the input filename found within the file contents: --fn_as_s_name (see docs).

I just checked, and this confirms that it works as you'd expect:

$ multiqc fastqc_data --fn_as_s_name

  /// MultiQC 🔍 | v1.14.dev0 (45c5408)

|           multiqc | Using log filenames for sample names
|           multiqc | Search path : /Users/ewels/Downloads/multi_qc/fastqc_data
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 8/8
|            fastqc | Found 4 reports
|           multiqc | Compressing plot data
|           multiqc | Report      : multiqc_report.html
|           multiqc | Data        : multiqc_data
|           multiqc | MultiQC complete
image
ewels commented 1 year ago

Ok, I think I'll close this issue now as I'm pretty sure that we figured out what was going wrong and we have some good solutions. Let me know how you get on, if you're still having problems we can always reopen it.

oligomyeggo commented 1 year ago

Hi @ewels , thank you so much! That was an embarrassing oversight on my part; I do in fact have actual paired-end data. Doing some digging, it turns out that I had my FastQC snakemake rule set-up wrong, where it was running the R1 files twice but saving the output as R1 and R2. I have fixed that, and now MultiQC is working as expected. Thank you so much for your time and your thorough explanations on some of the flag options - I really appreciate it!

ewels commented 1 year ago

Hooray! Glad you got it working 😀