Closed olgabot closed 2 years ago
Ok, so if I understand you correctly, we should just not plot Retained read pairs
? As this is already a sum of the other categories?
It could be, however the plot is indicating number of "Full-length collapsed pairs" and "uncollapsed pairs," and "singleton R{1,2}" are all separate. Here's some log files run on the same sample for reference.
Here's the critical output:
[FLASH] Read combination statistics:
[FLASH] Total pairs: 67277
[FLASH] Combined pairs: 66658
[FLASH] Innie pairs: 65235 (97.87% of combined)
[FLASH] Outie pairs: 1423 (2.13% of combined)
[FLASH] Uncombined pairs: 619
[FLASH] Percent combined: 99.08%
Total Pairs > Combined pairs
67 277 > 66 658
Full output:
[Trimming statistics]
Total number of read pairs: 67277
Number of unaligned read pairs: 385
Number of well aligned read pairs: 66892
Number of discarded mate 1 reads: 0
Number of singleton mate 1 reads: 11
Number of discarded mate 2 reads: 11
Number of singleton mate 2 reads: 0
Number of reads with adapters[1]: 2878
Number of full-length collapsed pairs: 66849
Number of truncated collapsed pairs: 0
Number of retained reads: 67694
Number of retained nucleotides: 22884082
Average length of retained reads: 338.052
Number of retained reads > Total number of read pairs > Number of well aligned read pairs > Number of full-length collapsed pairs
67 694 > 67 277 > 66 892 > 66 849
Seems like they match in their "Total number of read pairs" (67 277) and 2 significant figures for "full-length collapsed pairs" (66 658 for FLASH, 66 849 for AdapterRemoval), so I'd be inclined to think those are comparable. It doesn't seem like for this report, AdapterRemoval is treating the separate R1 and R2 of paired-end reads separately.
Anyway, I can make the PR to change this if this is making any sense.
if this is making any sense.
Not really, but I have a crying baby in the background which is making it difficult to think 😭
If you think you know how to fix it then, yeah - please put in a PR! That would be fab 😀
@alexanderscholz - do you have any feedback on this issue? You wrote the AdapterRemoval MultiQC module. I'll look into resolving this now but would be good to have your thoughts.
Phil
@jfy133 - we're using AdapterRemoval v2 in https://github.com/nf-core/eager too - what are your thoughts on this one?
@alexhbnr and I have already noticed that but Alex hasn't had time to fix and I still haven't learnt python ;).
We had decided the following: Retained reads should definitely be a separate plot. This is the only metric that matches the actual number of reads in the output file.
All other metrics can go together as they represent the actual processing of the reads themselves.
I submitted the pull request (#1204) with my initial changes and am happy if someone wants to think about the implemented changes and improve them.
It's a poor man's fix since it doesn't pull out the retained reads in its own plot but instead treats them as reads that haven't been changed, i.e. no singletons or collapsed reads. This get rid of the problem that the length of the bar plots are twice as long as they should be but more or less sacrifices the retained reads metric.
Nice, thanks @alexhbnr! The plot definitely looks a lot better, I'm sure that it is a step in the right direction. However the comment from @olgabot makes me think that we can't combine many of these categories together in a single stacked bar plot.
This is similar to work I was doing a couple of days in the Cutadapt module for e7ed1052b207c2396e69438b1134a39ed7358afa - I parsed a tonne of fields and then realised that I couldn't plot most of them as the categories didn't stack properly.
Here I'm not even sure that the % reads trimmed in the General Statistics is really trustworthy, given that @olgabot's example has more retained reads
(67694) than were inputted (67277)..
I also don't think that the % Trimmed Reads
in this sense is actually very interesting. What I would think that people actually want to know is what percentage of their data was lost. For Cutadapt I did this as the % of bases removed.
Phil
@MikkelSchubert - hello! As author of AdapterRemoval, I was wondering if you could help us?
We currently have a MultiQC module for AdapterRemoval, but we think that the bar plot it produces is wrong / misleading in the way that it stacks categories - see an example report: multiqc_report.html
Are you able to shed some light for us on how the different categories reported by AdapterRemoval are related, and which we would be able to use to create a stacked bar plot representing the total read count?
If you have any views on what would be best reported in MultiQC, that would also be very helpful.
Many thanks in advance,
Phil
Hey Phil - just some clarifications (but I'm sure Mikkel will be better here), the inputted reads you've written above is read pairs, whereas the retained reads represent each unique entry in the FASTQ file i.e. will include merged reads, but also unmerged reads as a form of 'singletons', which is why you can have more (because 1 un-merged read pair turns in 1).
The way I would personally plot this would be as follows (based off of Olga's stats table), if I understand the statistics correctly:
[Trimming statistics]
Maybe just for general stats for validation e.g. compared to FASTQC
Total number of read pairs: 67277
stacked bar; could also have percentage option
Number of unaligned read pairs: 385 Number of well aligned read pairs: 66892
[This basically reports the number of input reads pairs that overlap, prior 'collapsing'; if the input is paired end and 'collapsing' is selected by the user - therefore unaligned read pairs are those that did not overlap so the insert size is bigger than the sequencing cycles]
stacked bar; percentage option wouldn't make sense here - see comment
Number of full-length collapsed pairs: 66849 (merged reads, no adapter so untrimmed) Number of truncated collapsed pairs: 0 (merged reads, adapter trimmed) Number of singleton mate 1 reads: 11 (unmerged forward reads) Number of singleton mate 2 reads: 0 (unmerged reverse reads) Number of discarded mate 1 reads: 0 (low-quality removed reads, e.g. length filter) Number of discarded mate 2 reads: 11 (low-quality removed reads, e.g. length filter)
[This would represent the categories that could be included as single entries in the 'Retained Reads category; however, this totals is 66860, meaning we are still missing some reads, something I've never been able to work out - only thing I can think of is that these are single reads without one of a pair in the input FASTQ file itself; but I'm not sure if that should really exist coming off demultiplexing ]
barplot
Number of retained reads: 67694
I agree with @jfy133, separate plots would make more sense. However, for the suggested Quality filtering, the six categories included into it (full-length collapsed, truncated collapsed, singleton mate 1 and 2, and discarded mate 1 and 2) do not sum up to the number of retained reads.
For the example above, retained reads (67694) > collapsed reads (66849) + singletons (22).
So it would make sense to get feedback from @MikkelSchubert how he actually calculates these.
Would could always calculate the difference and make a new plot category called "Mystery reads" 😆
I apologize for the confusing output from AdapterRemoval; it is something I've been wanting to rework, but never got around to. It is not clear to me exactly what you want to show, so I'll just try to explain the output you are seeing:
As has been noted, the numbers below show either pairs or individual reads. You need to keep this in mind, since a pair accounts for two individual reads.
Total number of read pairs: 67277
The total number of input read pairs.
Number of unaligned read pairs: 385
Number of well aligned read pairs: 66892
Statistics about the alignment pair-wise alignments; the sum of these numbers correspond to the number of (input) reads.
Number of discarded mate 1 reads: 0
Number of singleton mate 1 reads: 11
Number of discarded mate 2 reads: 11
Number of singleton mate 2 reads: 0
Statistics for discarded reads and "singleton" reads; this gets a bit complicated, especially when you have merged reads. Firstly, a singleton read is a read in which the mate (but not the read itself) was discarded, meaning that a pair that was discarded increments both "discarded mate 1" and "discarded mate 2", while a pair with one discarded mate increments "discarded mate A" and "singleton mate B", where A != B. In addition, merged reads that are discarded are counted as if the pair was discarded (increments both "discarded mate" values).
Number of reads with adapters[1]: 2878
This number is the number of reads with adapter sequence that got trimmed; for normal paired reads (where the read lengths are identical) this will always be an even number.
Number of full-length collapsed pairs: 66849
Number of truncated collapsed pairs: 0
The number of merged reads, with "truncated" pairs being those merged reads that were trimmed following merging (due to low quality bases or Ns). Note that these numbers do not include discarded merged reads.
Number of retained reads: 67694
Number of retained nucleotides: 22884082
The total number of output reads (and how many nucleotides that amounts to); a simple way to calculate this is 2 input - sum(discarded) - sum(merged), since each input pair accounts for 2 reads, the discarded reads are excluded, and merging transforms two input reads into one output read; 2 67277 - (11 + 0) - (66849 + 0) = 67694.
As has been noted, this matches the sum of reads in the output files (excluding discarded reads), where a merged pair has been transformed into a single read.
Many thanks @MikkelSchubert - this is brilliant! Any chance that this text could be put in the AdapterRemoval docs somewhere (https://adapterremoval.readthedocs.io/en/latest/)? I'm happy to open a PR pasting your issue text if you'd like.
Need to do some head scratching now and figure out is the most important information / what people actually want to know in the MultiQC report. @olgabot - what were you interested in? I guess the number of reads that were collapsed?
Phil
In addition - I have a request from @aidaanva that the percentage of reads could also be calculated and reported specifically in the general stats table (as gives you an idea of how well your PE data has merged - this is a good indicator for ancient DNA)
I guess this would be Number of well aligned read pairs
/ Total number of read pairs
, if I understand Mikkel correctly. @apeltzer is that how you calculated in ReportTable.
Number of reads with adapters[1]: 2878
This number is the number of reads with adapter sequence that got trimmed; for normal paired reads (where the read lengths are > identical) this will always be an even number.
Hi @MikkelSchubert just to clarify this, if we wanted to know the fraction of 'output reads' that had adapters removed (presumably prior quality trimming etc.), what would the denominator be here? For example, would it be: Number of reads with adapters[1]:
/ (Number of well aligned read pairs
/ sum(Number of singleton mate * reads
))?
I did a bit of looking through Mikkel's description above, and have the following suggestion for table/figures
@apeltzer and @alexhbnr both provisionlly agreed that this made sense. Note that the same should work also for single end data as you just drop the read collapsing section.
For general stats table I think we could have:
total (Total number of read pairs)
Total: sum of both
discarded_m1 (Number of discarded mate 1 reads) discarded_m2 (Number of singleton mate 1 reads) singleton_m1 (Number of discarded mate 2 reads) singleton_m2 ( Number of singleton mate 2 reads)
This one I'm really sure how to display
Total: total input pairs
@MikkelSchubert do you think this is sensible too? If yes, we will implement it in the PR that @alexhbnr already opened (#1204 ) and make sure this ends up in the next MultiQC release 👍🏻
Hi @MikkelSchubert just to clarify this, if we wanted to know the fraction of 'output reads' that had adapters removed (presumably prior quality trimming etc.), what would the denominator be here? For example, would it be: Number of reads with adapters[1]: / (Number of well aligned read pairs / sum(Number of singleton mate * reads))?
Unfortunately it is not possible to calculate the percentage of output reads that had adapter removed, not in all situations. This is because the read alignment/adapter trimming step is performed first, and then merging and quality trimming/filtering is performed without tracking whether or not the read originally contained adapter sequences.
Note that a "well aligned" read may not contain adapter sequence, since well aligned just means that it aligned and passed the quality criteria for alignments, not that the alignment included adapter sequence. Well aligned also does not mean able to be merged, since there is a separate option that dictates the minimum amount of overlapping bases required to merge two aligned reads.
So you can only calculate a percentage in the situation where you don't merge reads and where you do want to count both discarded and retained reads. In other words, it is only possible to calculate the percentage of input reads with adapter sequences:
N_se = Total number of reads
N_pe = Total number of read pairs
A = sum(Number of reads with adapters[*])
Fraction_se = A / N_se
Fraction_pe = A / (2 * N_pe)
Note the distinction between reads and pairs, which is due to AR not assuming that read lengths are identical in a pair, meaning that either or both reads in a pair can have adapter sequences.
I'm back to looking at cleaning all of this up (and generating both JSON and human readable reports while I'm at it), so do let me know what kind of statistics you'd like to have and if there is anything else that'd make your lives easier.
Hi @MikkelSchubert ,
Ok thanks for clarifying. I think I still need to get my head around exactly how AR is working (I guess I don't fully understand what 'aligned' is referring to now).
For your last point: have you got an issue open on the AR github where we could talk about requests rather than polluting this issue? JSON would be fantastic though, as JSON is the preferred input format for MulitQC (with a certain formatting, but we can also try and modify your JSON to fit MultiQC ourselfs with a custom script etc.).
Going back to the issue at hand - OK, then maybe the number of output reads with adapters removed isn't so bad. For me working with aDNA it's sort of a rough quality check because I expect most of my reads to have adapters as the reads are so short. So your calculation described above could equally work.
New proposal then:
For general stats table I think we could have:
Adapter trimming (stacked bar chart) [PE & SE]
below]Number of reads with adapters[*]
) with bar total being (Total number of read pairs
* 2)Number of reads with adapters[*]
) with bar total being Total number of reads
This one I'm really sure how to display, maybe: one category being discarded, and second category of 'singletons'?
Total: total input pairs
retained
Post trimming cleanup of read pairs prior collapsing [PE & SE]
As I noted above (also see below), cleanup takes place after the alignment/merging.
[remainder] (assumed would be discarded pairs?)
I'm not sure I understand this correctly, but non-collapsed pairs are not discarded by AdapterRemoval and it is currently not possible to see how many collapsed reads were discarded, so the most meaningful remainder would probably be all non-collapsed input pairs: remainder = pairs - collapsed - truncated_collapsed.
With regards to your justifiable confusion, the TL;DR is that "well aligned" is a statistic that in its current form was probably mostly useful to the original creator of AR. I plan on removing it in the reworked statistics.
With the following definitions
F = mate 1 (forward, in input orientation)
R = mate 2 (reverse, in input orientation)
A = Adapter 1 (user supplied orientation)
B = Adapter 2 (user supplied orientation)
Lower-case f, r, a, and b signifies the reverse complement of the above
The simple case is SE alignments where AR just does a pairwise alignment of the read (F) and and the adapter (A)
1) No overlap: Unaligned
FFFFFFFFFF
AAAAA
2) Aligned, but poor alignment: Unaligned
FFFFFFFFFF
AAAAA
3) Aligned, good alignment: Well aligned, adapter trimmed
FFFFFFFFFF
AAAAA
Only in case (3) is a read trimmed for adapters and counted as having been trimmed.
The situation is a bit more complex for PE reads, where reads and adapters are merged prior to alignment:
1) No overlap: Unaligned
bbbbbFFFFFFFFFF
rrrrrrrrrrAAAAA
2) Aligned, but poor alignment: Unaligned
bbbbbFFFFFFFFFF
rrrrrrrrrrAAAAA
3) Aligned, good alignment, but insert only aligns a little: Well aligned
bbbbbFFFFFFFFFF
rrrrrrrrrrAAAAA
4) Aligned, good alignment, longer but not complete insert alignment: Well aligned, merged if enabled
bbbbbFFFFFFFFFF
rrrrrrrrrrAAAAA
5) As 4 and alignment includes adapter: Well aligned, adapter trimmed, merged if enabled
bbbbbFFFFFFFFFF
rrrrrrrrrrAAAAA
Whether an alignment falls into case (3) or (4) depends on the --minalignmentlength option, which defaults to requiring an overlap of at least 11bp.
Reads are trimmed of adapter sequence, merged if possible, and then trimming of low quality bases and filtering of reads based on quality criteria is performed.
@MikkelSchubert Aha, thank you for all your patience! I think I'm starting to understand it now.
[remainder] (assumed would be discarded pairs?)
What you suggested above is actually what I meant, so that's good to know.
And for the alignment, I see - so aligned
doesn't have anything to do actually with the trimming itself, it's just a 'precurser' summary of the data prior processing (so to say?)
I think then for MultiQC, the following 'simplified' version would be maybe sufficient, where we will always just compare with input reads
So for General stats:
Category | SE | PE |
---|---|---|
Input reads | Total number of reads | Total number of read pairs * 2 |
% Trimmed input reads | sum(Number of reads with adapters[*]) / Total number of reads | sum(Number of reads with adapters[]) / (Total number of read pairs 2) |
% Merged good-quality input read pairs | NA | sum(Number of * collapsed pairs) / Total number of read pairs |
Number of singletons retained | NA | sum(Number of singleton mate * reads) |
Retained reads | Number of retained reads | Number of retained reads |
And then for the plots
I agree with your suggestion, @jfy133 , but I would propose to drop plot 1 because the total number of in put reads is already contained in plot 2. Otherwise, it looks great!
I've recently also found this issue and made a PR (#1423), but then found out there was already a long discussion about it. :smile: Not sure if I can help much, but here goes my two cents.
I agree with the General Stats info that has been proposed so far. As for the plots, I'd keep all in number of reads (as it been proposed), and I've run an example sample:
Stats | number of reads |
---|---|
Total number of read pairs: | 83,894,072 |
Number of unaligned read pairs: | 60,035,513 |
Number of well aligned read pairs: | 23,858,559 |
Number of discarded mate 1 reads: | 1,716,701 |
Number of singleton mate 1 reads: | 55,102 |
Number of discarded mate 2 reads: | 1,770,227 |
Number of singleton mate 2 reads: | 1,576 |
Number of reads with adapters[1]: | 5,612,302 |
Number of full-length collapsed pairs: | 8,093,391 |
Number of truncated collapsed pairs: | 1,457,443 |
Number of retained reads: | 154,750,382 |
Number of retained nucleotides: | 15,775,443,835 |
Average length of retained reads: | 101.941 |
output files | number of reads |
---|---|
reads.collapsed.gz | 8,093,391 |
reads.collapsed.truncated.gz | 1,457,443 |
reads.discarded.gz | 1,779,706 |
reads.pair1.truncated.gz | 72,571,435 |
reads.pair1.truncated.gz | 72,571,435 |
reads.singleton.truncated.gz | 56,678 |
According to these numbers, we get:
total_read_pairs = aligned_read_pairs + unaligned_read_pairs
retained_reads = R1 + R2 + Singleton R1 + Singleton R2 + Full length Collapsed + Truncated Collapsed
discarded_reads = Discarded R1 + Discarded R2
If so then I'd suggest:
retained_reads
+ discarded_reads
(kind of as it is now)There is however, one thing that does not quite add up. I'd expect that 2*well_aligned_read_pairs = retained_reads + discarded_reads
, but that is not the case. Maybe @MikkelSchubert can shed some light here. If we figure this out, we might be able to also drop plot 1, and add the remaining read fraction to plot 4.
Guess this will change everything here: https://github.com/MikkelSchubert/adapterremoval/blob/master/CHANGES.md
Mikkel was so nice to add quite a number of things there which should make the overall process of updating the MultiQC module for AR quite easy and much clearer than before 👍🏻
There's an example the JSON output here: example.json.gz
The format is not final, so do let me know if something is missing, makes your work harder, or is just plain stupid.
For now read counts are always given in terms of input reads (so the number of merged sequenced in the output is 1/2 the number of reads written under the "merged" section), since that means that the numbers in the file always add up, but that might be changed. It's gonna need some explanation in either case.
Quality/base composition curves are currently based on a configurable 10% of reads since the overhead from collecting those is not insignificant. The number of reads sampled is listed. Length distributions are based on all data. I'd prefer to use 100% of data in all cases, but I also don't want to increase the runtime to do so and so it'll depend on how much I can optimize it.
For now, we have a fix in https://github.com/ewels/MultiQC/pull/1647
Can revisit this when the new version is released 👍🏻
Description of bug:
We tested using AdapterRemoval and FLASH to merge reads for a CRISPR analysis pipeline and generated MultiQC outputs for both. It was confusing to see them side-by-side as AdapterRemoval looked like it had 2x the reads, but really I think some of the numbers weren't getting subtracted from one another.
AdapterRemoval output
It seems that AdapterRemoval is showing total reads when it should be showing:
FLASH output (on same samples)
File that triggers the error: Please drag and drop (and upload to the GitHub issue) an input file that I can use to replicate the error.
Here are some log files:
adapterremoval_logs.zip flash_logs.zip
MultiQC run details (please complete the following):
Command used to run MultiQC:
multiqc -f $rtitle $rfilename --config $multiqc_config . -m fastqc -m adapterRemoval -m flash
MultiQC Version: MultiQC v1.7
Operating System: Ubuntu
Python Version:
Method of MultiQC installation: conda, this is the environment.yml