MultiQC / MultiQC

Aggregate results from bioinformatics analyses across many samples into a single report.
http://multiqc.info
GNU General Public License v3.0
1.21k stars 599 forks source link

AdapterRemoval output confusing vs FLASH #1005

Closed olgabot closed 2 years ago

olgabot commented 5 years ago

Description of bug:

We tested using AdapterRemoval and FLASH to merge reads for a CRISPR analysis pipeline and generated MultiQC outputs for both. It was confusing to see them side-by-side as AdapterRemoval looked like it had 2x the reads, but really I think some of the numbers weren't getting subtracted from one another.

AdapterRemoval output

Screen Shot 2019-08-15 at 4 54 17 AM

It seems that AdapterRemoval is showing total reads when it should be showing:

FLASH output (on same samples)

Screen Shot 2019-08-15 at 4 54 12 AM

File that triggers the error: Please drag and drop (and upload to the GitHub issue) an input file that I can use to replicate the error.

Here are some log files:

adapterremoval_logs.zip flash_logs.zip

MultiQC run details (please complete the following):

ewels commented 5 years ago

Ok, so if I understand you correctly, we should just not plot Retained read pairs? As this is already a sum of the other categories?

olgabot commented 5 years ago

It could be, however the plot is indicating number of "Full-length collapsed pairs" and "uncollapsed pairs," and "singleton R{1,2}" are all separate. Here's some log files run on the same sample for reference.

FLASH output

Here's the critical output:

[FLASH] Read combination statistics:
[FLASH]     Total pairs:      67277
[FLASH]     Combined pairs:   66658
[FLASH]         Innie pairs:   65235 (97.87% of combined)
[FLASH]         Outie pairs:   1423 (2.13% of combined)
[FLASH]     Uncombined pairs: 619
[FLASH]     Percent combined: 99.08%

Total Pairs > Combined pairs

67 277 > 66 658

Full output:

``` Sun 18 Aug - 12:07  ~   grep FLASH /Users/olgabot/Downloads/mNGplate3/crispresso2_vanilla/cripresso/mNGplate3_sorted_A1_POLR1A-C/CRISPResso_on_mNGplate3_sorted_A1_POLR1A-C_S97_R1_001_mNGplate3_sorted_A1_POLR1A-C_S97_R2_001/CRISPResso_RUNNING_LOG.txt Running FLASH command: flash mNGplate3_sorted_A1_POLR1A-C_S97_R1_001.fastq.gz mNGplate3_sorted_A1_POLR1A-C_S97_R2_001.fastq.gz --min-overlap 10 --max-overlap 100 --allow-outies -z -d /tmp/nxf.BJPjRHsyZR/mNGplate3_sorted_A1_POLR1A-C/CRISPResso_on_mNGplate3_sorted_A1_POLR1A-C_S97_R1_001_mNGplate3_sorted_A1_POLR1A-C_S97_R2_001 -o out >>/tmp/nxf.BJPjRHsyZR/mNGplate3_sorted_A1_POLR1A-C/CRISPResso_on_mNGplate3_sorted_A1_POLR1A-C_S97_R1_001_mNGplate3_sorted_A1_POLR1A-C_S97_R2_001/CRISPResso_RUNNING_LOG.txt 2>&1 [FLASH] Starting FLASH v1.2.11 [FLASH] Fast Length Adjustment of SHort reads [FLASH] [FLASH] Input files: [FLASH] mNGplate3_sorted_A1_POLR1A-C_S97_R1_001.fastq.gz [FLASH] mNGplate3_sorted_A1_POLR1A-C_S97_R2_001.fastq.gz [FLASH] [FLASH] Output files: [FLASH] /tmp/nxf.BJPjRHsyZR/mNGplate3_sorted_A1_POLR1A-C/CRISPResso_on_mNGplate3_sorted_A1_POLR1A-C_S97_R1_001_mNGplate3_sorted_A1_POLR1A-C_S97_R2_001/out.extendedFrags.fastq.gz [FLASH] /tmp/nxf.BJPjRHsyZR/mNGplate3_sorted_A1_POLR1A-C/CRISPResso_on_mNGplate3_sorted_A1_POLR1A-C_S97_R1_001_mNGplate3_sorted_A1_POLR1A-C_S97_R2_001/out.notCombined_1.fastq.gz [FLASH] /tmp/nxf.BJPjRHsyZR/mNGplate3_sorted_A1_POLR1A-C/CRISPResso_on_mNGplate3_sorted_A1_POLR1A-C_S97_R1_001_mNGplate3_sorted_A1_POLR1A-C_S97_R2_001/out.notCombined_2.fastq.gz [FLASH] /tmp/nxf.BJPjRHsyZR/mNGplate3_sorted_A1_POLR1A-C/CRISPResso_on_mNGplate3_sorted_A1_POLR1A-C_S97_R1_001_mNGplate3_sorted_A1_POLR1A-C_S97_R2_001/out.hist [FLASH] /tmp/nxf.BJPjRHsyZR/mNGplate3_sorted_A1_POLR1A-C/CRISPResso_on_mNGplate3_sorted_A1_POLR1A-C_S97_R1_001_mNGplate3_sorted_A1_POLR1A-C_S97_R2_001/out.histogram [FLASH] [FLASH] Parameters: [FLASH] Min overlap: 10 [FLASH] Max overlap: 100 [FLASH] Max mismatch density: 0.250000 [FLASH] Allow "outie" pairs: true [FLASH] Cap mismatch quals: false [FLASH] Combiner threads: 64 [FLASH] Input format: FASTQ, phred_offset=33 [FLASH] Output format: FASTQ, phred_offset=33, gzip [FLASH] [FLASH] Starting reader and writer threads [FLASH] Starting 64 combiner threads [FLASH] Processed 25000 read pairs [FLASH] Processed 50000 read pairs [FLASH] Processed 67277 read pairs [FLASH] [FLASH] Read combination statistics: [FLASH] Total pairs: 67277 [FLASH] Combined pairs: 66658 [FLASH] Innie pairs: 65235 (97.87% of combined) [FLASH] Outie pairs: 1423 (2.13% of combined) [FLASH] Uncombined pairs: 619 [FLASH] Percent combined: 99.08% [FLASH] [FLASH] Writing histogram files. [FLASH] WARNING: An unexpectedly high proportion of combined pairs (99.70%) increasing this parameter. (As-is, FLASH is penalizing overlaps longer than [FLASH] [FLASH] FLASH v1.2.11 complete! [FLASH] 1.611 seconds elapsed [FLASH] Finished with 1 warning (see above) ```

AdapterRemoval output

[Trimming statistics]
Total number of read pairs: 67277
Number of unaligned read pairs: 385
Number of well aligned read pairs: 66892
Number of discarded mate 1 reads: 0
Number of singleton mate 1 reads: 11
Number of discarded mate 2 reads: 11
Number of singleton mate 2 reads: 0
Number of reads with adapters[1]: 2878
Number of full-length collapsed pairs: 66849
Number of truncated collapsed pairs: 0
Number of retained reads: 67694
Number of retained nucleotides: 22884082
Average length of retained reads: 338.052

Number of retained reads > Total number of read pairs > Number of well aligned read pairs > Number of full-length collapsed pairs

67 694 > 67 277 > 66 892 > 66 849

``` Sun 18 Aug - 12:07  ~   head -n 45 /Users/olgabot/Downloads/mNGplate3/adapterremoval/adapterremoval/mNGplate3_sorted_A1_POLR1A-C.settings AdapterRemoval ver. 2.3.0 Trimming of paired-end reads [Adapter sequences] Adapter1[1]: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG Adapter2[1]: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT [Adapter trimming] RNG seed: 4242927527 Alignment shift value: 2 Global mismatch threshold: 0.333333 Quality format (input): Phred+33 Quality score max (input): 41 Quality format (output): Phred+33 Quality score max (output): 41 Mate-number separator (input): '/' Trimming 5p: 0 Trimming 3p: 0 Trimming Ns: Yes Trimming Phred scores <= 2: Yes Trimming using sliding windows: No Minimum genomic length: 15 Maximum genomic length: 4294967295 Collapse overlapping reads: Yes Minimum overlap (in case of collapse): 11 [Trimming statistics] Total number of read pairs: 67277 Number of unaligned read pairs: 385 Number of well aligned read pairs: 66892 Number of discarded mate 1 reads: 0 Number of singleton mate 1 reads: 11 Number of discarded mate 2 reads: 11 Number of singleton mate 2 reads: 0 Number of reads with adapters[1]: 2878 Number of full-length collapsed pairs: 66849 Number of truncated collapsed pairs: 0 Number of retained reads: 67694 Number of retained nucleotides: 22884082 Average length of retained reads: 338.052 ```

Seems like they match in their "Total number of read pairs" (67 277) and 2 significant figures for "full-length collapsed pairs" (66 658 for FLASH, 66 849 for AdapterRemoval), so I'd be inclined to think those are comparable. It doesn't seem like for this report, AdapterRemoval is treating the separate R1 and R2 of paired-end reads separately.

Anyway, I can make the PR to change this if this is making any sense.

ewels commented 5 years ago

if this is making any sense.

Not really, but I have a crying baby in the background which is making it difficult to think 😭

If you think you know how to fix it then, yeah - please put in a PR! That would be fab 😀

ewels commented 4 years ago

@alexanderscholz - do you have any feedback on this issue? You wrote the AdapterRemoval MultiQC module. I'll look into resolving this now but would be good to have your thoughts.

Phil

apeltzer commented 4 years ago

@jfy133 - we're using AdapterRemoval v2 in https://github.com/nf-core/eager too - what are your thoughts on this one?

jfy133 commented 4 years ago

@alexhbnr and I have already noticed that but Alex hasn't had time to fix and I still haven't learnt python ;).

We had decided the following: Retained reads should definitely be a separate plot. This is the only metric that matches the actual number of reads in the output file.

All other metrics can go together as they represent the actual processing of the reads themselves.

alexhbnr commented 4 years ago

I submitted the pull request (#1204) with my initial changes and am happy if someone wants to think about the implemented changes and improve them.

It's a poor man's fix since it doesn't pull out the retained reads in its own plot but instead treats them as reads that haven't been changed, i.e. no singletons or collapsed reads. This get rid of the problem that the length of the bar plots are twice as long as they should be but more or less sacrifices the retained reads metric.

ewels commented 4 years ago

Nice, thanks @alexhbnr! The plot definitely looks a lot better, I'm sure that it is a step in the right direction. However the comment from @olgabot makes me think that we can't combine many of these categories together in a single stacked bar plot.

This is similar to work I was doing a couple of days in the Cutadapt module for e7ed1052b207c2396e69438b1134a39ed7358afa - I parsed a tonne of fields and then realised that I couldn't plot most of them as the categories didn't stack properly.

Here I'm not even sure that the % reads trimmed in the General Statistics is really trustworthy, given that @olgabot's example has more retained reads (67694) than were inputted (67277)..

I also don't think that the % Trimmed Reads in this sense is actually very interesting. What I would think that people actually want to know is what percentage of their data was lost. For Cutadapt I did this as the % of bases removed.

Phil

ewels commented 4 years ago

@MikkelSchubert - hello! As author of AdapterRemoval, I was wondering if you could help us?

We currently have a MultiQC module for AdapterRemoval, but we think that the bar plot it produces is wrong / misleading in the way that it stacks categories - see an example report: multiqc_report.html

Are you able to shed some light for us on how the different categories reported by AdapterRemoval are related, and which we would be able to use to create a stacked bar plot representing the total read count?

If you have any views on what would be best reported in MultiQC, that would also be very helpful.

Many thanks in advance,

Phil

jfy133 commented 4 years ago

Hey Phil - just some clarifications (but I'm sure Mikkel will be better here), the inputted reads you've written above is read pairs, whereas the retained reads represent each unique entry in the FASTQ file i.e. will include merged reads, but also unmerged reads as a form of 'singletons', which is why you can have more (because 1 un-merged read pair turns in 1).


The way I would personally plot this would be as follows (based off of Olga's stats table), if I understand the statistics correctly:

[Trimming statistics]

Input

Maybe just for general stats for validation e.g. compared to FASTQC

Total number of read pairs: 67277

Merging

stacked bar; could also have percentage option

Number of unaligned read pairs: 385 Number of well aligned read pairs: 66892

[This basically reports the number of input reads pairs that overlap, prior 'collapsing'; if the input is paired end and 'collapsing' is selected by the user - therefore unaligned read pairs are those that did not overlap so the insert size is bigger than the sequencing cycles]

Quality filtering

stacked bar; percentage option wouldn't make sense here - see comment

Number of full-length collapsed pairs: 66849 (merged reads, no adapter so untrimmed) Number of truncated collapsed pairs: 0 (merged reads, adapter trimmed) Number of singleton mate 1 reads: 11 (unmerged forward reads) Number of singleton mate 2 reads: 0 (unmerged reverse reads) Number of discarded mate 1 reads: 0 (low-quality removed reads, e.g. length filter) Number of discarded mate 2 reads: 11 (low-quality removed reads, e.g. length filter)

[This would represent the categories that could be included as single entries in the 'Retained Reads category; however, this totals is 66860, meaning we are still missing some reads, something I've never been able to work out - only thing I can think of is that these are single reads without one of a pair in the input FASTQ file itself; but I'm not sure if that should really exist coming off demultiplexing ]

Output reads

barplot

Number of retained reads: 67694

alexhbnr commented 4 years ago

I agree with @jfy133, separate plots would make more sense. However, for the suggested Quality filtering, the six categories included into it (full-length collapsed, truncated collapsed, singleton mate 1 and 2, and discarded mate 1 and 2) do not sum up to the number of retained reads.

For the example above, retained reads (67694) > collapsed reads (66849) + singletons (22).

So it would make sense to get feedback from @MikkelSchubert how he actually calculates these.

ewels commented 4 years ago

Would could always calculate the difference and make a new plot category called "Mystery reads" 😆

MikkelSchubert commented 4 years ago

I apologize for the confusing output from AdapterRemoval; it is something I've been wanting to rework, but never got around to. It is not clear to me exactly what you want to show, so I'll just try to explain the output you are seeing:

As has been noted, the numbers below show either pairs or individual reads. You need to keep this in mind, since a pair accounts for two individual reads.

Total number of read pairs: 67277

The total number of input read pairs.

Number of unaligned read pairs: 385
Number of well aligned read pairs: 66892

Statistics about the alignment pair-wise alignments; the sum of these numbers correspond to the number of (input) reads.

Number of discarded mate 1 reads: 0
Number of singleton mate 1 reads: 11
Number of discarded mate 2 reads: 11
Number of singleton mate 2 reads: 0

Statistics for discarded reads and "singleton" reads; this gets a bit complicated, especially when you have merged reads. Firstly, a singleton read is a read in which the mate (but not the read itself) was discarded, meaning that a pair that was discarded increments both "discarded mate 1" and "discarded mate 2", while a pair with one discarded mate increments "discarded mate A" and "singleton mate B", where A != B. In addition, merged reads that are discarded are counted as if the pair was discarded (increments both "discarded mate" values).

Number of reads with adapters[1]: 2878

This number is the number of reads with adapter sequence that got trimmed; for normal paired reads (where the read lengths are identical) this will always be an even number.

Number of full-length collapsed pairs: 66849
Number of truncated collapsed pairs: 0

The number of merged reads, with "truncated" pairs being those merged reads that were trimmed following merging (due to low quality bases or Ns). Note that these numbers do not include discarded merged reads.

Number of retained reads: 67694
Number of retained nucleotides: 22884082

The total number of output reads (and how many nucleotides that amounts to); a simple way to calculate this is 2 input - sum(discarded) - sum(merged), since each input pair accounts for 2 reads, the discarded reads are excluded, and merging transforms two input reads into one output read; 2 67277 - (11 + 0) - (66849 + 0) = 67694.

As has been noted, this matches the sum of reads in the output files (excluding discarded reads), where a merged pair has been transformed into a single read.

ewels commented 4 years ago

Many thanks @MikkelSchubert - this is brilliant! Any chance that this text could be put in the AdapterRemoval docs somewhere (https://adapterremoval.readthedocs.io/en/latest/)? I'm happy to open a PR pasting your issue text if you'd like.

Need to do some head scratching now and figure out is the most important information / what people actually want to know in the MultiQC report. @olgabot - what were you interested in? I guess the number of reads that were collapsed?

Phil

jfy133 commented 4 years ago

In addition - I have a request from @aidaanva that the percentage of reads could also be calculated and reported specifically in the general stats table (as gives you an idea of how well your PE data has merged - this is a good indicator for ancient DNA)

I guess this would be Number of well aligned read pairs / Total number of read pairs, if I understand Mikkel correctly. @apeltzer is that how you calculated in ReportTable.

jfy133 commented 3 years ago

Number of reads with adapters[1]: 2878 This number is the number of reads with adapter sequence that got trimmed; for normal paired reads (where the read lengths are > identical) this will always be an even number.

Hi @MikkelSchubert just to clarify this, if we wanted to know the fraction of 'output reads' that had adapters removed (presumably prior quality trimming etc.), what would the denominator be here? For example, would it be: Number of reads with adapters[1]: / (Number of well aligned read pairs / sum(Number of singleton mate * reads))?

jfy133 commented 3 years ago

I did a bit of looking through Mikkel's description above, and have the following suggestion for table/figures

@apeltzer and @alexhbnr both provisionlly agreed that this made sense. Note that the same should work also for single end data as you just drop the read collapsing section.


General Stats

For general stats table I think we could have:

AdapterRemoval Section

Input reads (dot plot)

total (Total number of read pairs)

Adapter trimming (stacked bar chart)

Total: sum of both

Post trimming cleanup of read pairs prior collapsing

discarded_m1 (Number of discarded mate 1 reads) discarded_m2 (Number of singleton mate 1 reads) singleton_m1 (Number of discarded mate 2 reads) singleton_m2 ( Number of singleton mate 2 reads)

This one I'm really sure how to display

Read collapsing (stacked bar chart)

Total: total input pairs

Output from AdapterRemoval (dot plot)

apeltzer commented 3 years ago

@MikkelSchubert do you think this is sensible too? If yes, we will implement it in the PR that @alexhbnr already opened (#1204 ) and make sure this ends up in the next MultiQC release 👍🏻

MikkelSchubert commented 3 years ago

Hi @MikkelSchubert just to clarify this, if we wanted to know the fraction of 'output reads' that had adapters removed (presumably prior quality trimming etc.), what would the denominator be here? For example, would it be: Number of reads with adapters[1]: / (Number of well aligned read pairs / sum(Number of singleton mate * reads))?

Unfortunately it is not possible to calculate the percentage of output reads that had adapter removed, not in all situations. This is because the read alignment/adapter trimming step is performed first, and then merging and quality trimming/filtering is performed without tracking whether or not the read originally contained adapter sequences.

Note that a "well aligned" read may not contain adapter sequence, since well aligned just means that it aligned and passed the quality criteria for alignments, not that the alignment included adapter sequence. Well aligned also does not mean able to be merged, since there is a separate option that dictates the minimum amount of overlapping bases required to merge two aligned reads.

So you can only calculate a percentage in the situation where you don't merge reads and where you do want to count both discarded and retained reads. In other words, it is only possible to calculate the percentage of input reads with adapter sequences:

N_se = Total number of reads
N_pe = Total number of read pairs
A = sum(Number of reads with adapters[*])

Fraction_se = A / N_se
Fraction_pe = A / (2 * N_pe)

Note the distinction between reads and pairs, which is due to AR not assuming that read lengths are identical in a pair, meaning that either or both reads in a pair can have adapter sequences.

I'm back to looking at cleaning all of this up (and generating both JSON and human readable reports while I'm at it), so do let me know what kind of statistics you'd like to have and if there is anything else that'd make your lives easier.

jfy133 commented 3 years ago

Hi @MikkelSchubert ,

Ok thanks for clarifying. I think I still need to get my head around exactly how AR is working (I guess I don't fully understand what 'aligned' is referring to now).

For your last point: have you got an issue open on the AR github where we could talk about requests rather than polluting this issue? JSON would be fantastic though, as JSON is the preferred input format for MulitQC (with a certain formatting, but we can also try and modify your JSON to fit MultiQC ourselfs with a custom script etc.).

Going back to the issue at hand - OK, then maybe the number of output reads with adapters removed isn't so bad. For me working with aDNA it's sort of a rough quality check because I expect most of my reads to have adapters as the reads are so short. So your calculation described above could equally work.

New proposal then:


General Stats

For general stats table I think we could have:

AdapterRemoval Section

Input reads (dot plot)

Adapter trimming (stacked bar chart) [PE & SE]

Post trimming cleanup of read pairs prior collapsing [PE & SE]

This one I'm really sure how to display, maybe: one category being discarded, and second category of 'singletons'?

Read collapsing (stacked bar chart) [PE ONLY]

Total: total input pairs

Output from AdapterRemoval (dot plot)

retained

MikkelSchubert commented 3 years ago

Post trimming cleanup of read pairs prior collapsing [PE & SE]

As I noted above (also see below), cleanup takes place after the alignment/merging.

[remainder] (assumed would be discarded pairs?)

I'm not sure I understand this correctly, but non-collapsed pairs are not discarded by AdapterRemoval and it is currently not possible to see how many collapsed reads were discarded, so the most meaningful remainder would probably be all non-collapsed input pairs: remainder = pairs - collapsed - truncated_collapsed.

With regards to your justifiable confusion, the TL;DR is that "well aligned" is a statistic that in its current form was probably mostly useful to the original creator of AR. I plan on removing it in the reworked statistics.

With the following definitions

F = mate 1 (forward, in input orientation)
R = mate 2 (reverse, in input orientation)
A = Adapter 1 (user supplied orientation)
B = Adapter 2 (user supplied orientation)

Lower-case f, r, a, and b signifies the reverse complement of the above

The simple case is SE alignments where AR just does a pairwise alignment of the read (F) and and the adapter (A)

1) No overlap: Unaligned
    FFFFFFFFFF
                AAAAA
2) Aligned, but poor alignment: Unaligned
    FFFFFFFFFF
            AAAAA
3) Aligned, good alignment: Well aligned, adapter trimmed
    FFFFFFFFFF
           AAAAA

Only in case (3) is a read trimmed for adapters and counted as having been trimmed.

The situation is a bit more complex for PE reads, where reads and adapters are merged prior to alignment:

1) No overlap: Unaligned
    bbbbbFFFFFFFFFF
                     rrrrrrrrrrAAAAA
2) Aligned, but poor alignment: Unaligned
    bbbbbFFFFFFFFFF
                 rrrrrrrrrrAAAAA
3) Aligned, good alignment, but insert only aligns a little: Well aligned
    bbbbbFFFFFFFFFF
                 rrrrrrrrrrAAAAA
4) Aligned, good alignment, longer but not complete insert alignment: Well aligned, merged if enabled
    bbbbbFFFFFFFFFF
               rrrrrrrrrrAAAAA
5) As 4 and alignment includes adapter: Well aligned, adapter trimmed, merged if enabled
    bbbbbFFFFFFFFFF
      rrrrrrrrrrAAAAA

Whether an alignment falls into case (3) or (4) depends on the --minalignmentlength option, which defaults to requiring an overlap of at least 11bp.

Reads are trimmed of adapter sequence, merged if possible, and then trimming of low quality bases and filtering of reads based on quality criteria is performed.

jfy133 commented 3 years ago

@MikkelSchubert Aha, thank you for all your patience! I think I'm starting to understand it now.

[remainder] (assumed would be discarded pairs?)

What you suggested above is actually what I meant, so that's good to know.

And for the alignment, I see - so aligned doesn't have anything to do actually with the trimming itself, it's just a 'precurser' summary of the data prior processing (so to say?)


I think then for MultiQC, the following 'simplified' version would be maybe sufficient, where we will always just compare with input reads

So for General stats:

Category SE PE
Input reads Total number of reads Total number of read pairs * 2
% Trimmed input reads sum(Number of reads with adapters[*]) / Total number of reads sum(Number of reads with adapters[]) / (Total number of read pairs 2)
% Merged good-quality input read pairs NA sum(Number of * collapsed pairs) / Total number of read pairs
Number of singletons retained NA sum(Number of singleton mate * reads)
Retained reads Number of retained reads Number of retained reads

And then for the plots

image

alexhbnr commented 3 years ago

I agree with your suggestion, @jfy133 , but I would propose to drop plot 1 because the total number of in put reads is already contained in plot 2. Otherwise, it looks great!

fgvieira commented 3 years ago

I've recently also found this issue and made a PR (#1423), but then found out there was already a long discussion about it. :smile: Not sure if I can help much, but here goes my two cents.

I agree with the General Stats info that has been proposed so far. As for the plots, I'd keep all in number of reads (as it been proposed), and I've run an example sample:

Stats number of reads
Total number of read pairs: 83,894,072
Number of unaligned read pairs: 60,035,513
Number of well aligned read pairs: 23,858,559
Number of discarded mate 1 reads: 1,716,701
Number of singleton mate 1 reads: 55,102
Number of discarded mate 2 reads: 1,770,227
Number of singleton mate 2 reads: 1,576
Number of reads with adapters[1]: 5,612,302
Number of full-length collapsed pairs: 8,093,391
Number of truncated collapsed pairs: 1,457,443
Number of retained reads: 154,750,382
Number of retained nucleotides: 15,775,443,835
Average length of retained reads: 101.941
output files number of reads
reads.collapsed.gz 8,093,391
reads.collapsed.truncated.gz 1,457,443
reads.discarded.gz 1,779,706
reads.pair1.truncated.gz 72,571,435
reads.pair1.truncated.gz 72,571,435
reads.singleton.truncated.gz 56,678

According to these numbers, we get:

If so then I'd suggest:

There is however, one thing that does not quite add up. I'd expect that 2*well_aligned_read_pairs = retained_reads + discarded_reads, but that is not the case. Maybe @MikkelSchubert can shed some light here. If we figure this out, we might be able to also drop plot 1, and add the remaining read fraction to plot 4.

apeltzer commented 2 years ago

Guess this will change everything here: https://github.com/MikkelSchubert/adapterremoval/blob/master/CHANGES.md

Mikkel was so nice to add quite a number of things there which should make the overall process of updating the MultiQC module for AR quite easy and much clearer than before 👍🏻

MikkelSchubert commented 2 years ago

There's an example the JSON output here: example.json.gz

The format is not final, so do let me know if something is missing, makes your work harder, or is just plain stupid.

For now read counts are always given in terms of input reads (so the number of merged sequenced in the output is 1/2 the number of reads written under the "merged" section), since that means that the numbers in the file always add up, but that might be changed. It's gonna need some explanation in either case.

Quality/base composition curves are currently based on a configurable 10% of reads since the overhead from collecting those is not insignificant. The number of reads sampled is listed. Length distributions are based on all data. I'd prefer to use 100% of data in all cases, but I also don't want to increase the runtime to do so and so it'll depend on how much I can optimize it.

ewels commented 2 years ago

For now, we have a fix in https://github.com/ewels/MultiQC/pull/1647

Can revisit this when the new version is released 👍🏻