Closed kalavattam closed 1 year ago
Examining the flags in bams (bulk RNA-seq, paired-end sequencing) composed of only (a) "proper" primary alignments, (b) both "proper" primary and secondary alignments, and (c) only "proper" secondary alignments, it looks like the multimappers are deduplicated along with the primary alignments, but pair status seems to not be maintained for primary and secondary alignments in the bam outfiles for (b) and (c). Why is this? Is there a way to call umi_tools dedup
so that, when secondary alignments are present, the pair status is maintained?
(column 1: counts; column 2: bam flag)
# (a) primary alignments ----------------------------------
# infile (not deduplicated)
test_umi-tools-dedup/5781_Q_Is_UT.primary.list-tally-flags.txt
1033332 83
1033332 163
1011879 99
1011879 147
# UMI deduplication (method: directional)
test_umi-tools-dedup/5781_Q_Is_UT.primary.UMI-dedu.list-tally-flags.txt
1024479 83
1024479 163
1003082 99
1003082 147
# "positional" deduplication (--ignore-umi)
test_umi-tools-dedup/5781_Q_Is_UT.primary.pos-dedu.list-tally-flags.txt
979507 83
979507 163
974327 99
974327 147
# (b) primary and secondary alignments --------------------
# infile (not deduplicated)
test_umi-tools-dedup/5781_Q_Is_UT.primary-secondary.list-tally-flags.txt
1033332 83
1033332 163
1011879 99
1011879 147
289272 419
289272 339
52918 403
52918 355
# UMI deduplication (method: directional)
test_umi-tools-dedup/5781_Q_Is_UT.primary-secondary.UMI-dedu.list-tally-flags.txt
1023714 163
1023516 83
1002890 147
1002695 99
286338 339
277696 419
52292 355
43349 403
# "positional" deduplication (--ignore-umi)
test_umi-tools-dedup/5781_Q_Is_UT.primary-secondary.pos-dedu.list-tally-flags.txt
973699 147
973501 99
962868 163
962657 83
177502 339
169071 419
49472 355
40771 403
# (c) secondary alignments --------------------------------
# infile (not deduplicated)
test_umi-tools-dedup/5781_Q_Is_UT.secondary.list-tally-flags.txt
289272 419
289272 339
52918 403
52918 355
# UMI deduplication (method: directional)
test_umi-tools-dedup/5781_Q_Is_UT.secondary.UMI-dedu.list-tally-flags.txt
287206 339
282219 419
52414 355
47638 403
# "positional" deduplication (--ignore-umi)
test_umi-tools-dedup/5781_Q_Is_UT.secondary.pos-dedu.list-tally-flags.txt
190976 339
186087 419
50084 355
45395 403
How I did UMI deduplication (directional)
umi_tools dedup \
--paired \
--spliced-is-unique \
--unmapped-reads=discard \
--stdin={b_infile} \
--stdout={b_umi_dedu} \
--temp-dir={scratch} \
--output-stats={b_umi_stats} \
--log={b_umi_log} \
--error={b_umi_error} \
--timeit={b_umi_timeit} \
--timeit-header
How I did positional deduplication
umi_tools dedup \
--ignore-umi \
--paired \
--spliced-is-unique \
--unmapped-reads=discard \
--stdin={b_infile} \
--stdout={b_pos_dedu} \
--temp-dir={scratch} \
--log={b_pos_log} \
--error={b_pos_error} \
--timeit={b_pos_timeit} \
--timeit-header
Hi Kris,
As you will have seen, UMI-tools treats secondary alignments just like it treats any other reads. UMI-tools makes the decisions on which reads to output on the basis of information contained in read1 (including the value of the "mate position" field). When a decision is made to output read1, the read2 listed in the MNEXT and MPOS fields for that read are added to a buffer of reads that need outputting when they are encountered.
Different aligners handle which values go into the MNEXT and MPOS fields differently. The format specification specifies that the values in the MNEXT and MPOS fields of read1 must always point to the primary alignment of read2. So while the reads might be mapped as a pair, they may not point to each other as mates. Unfortunately, as the information about the "real" mate is not encoded anywhere in the BAM file, there is not much that UMI-tools can do about this. However, some aligners ignore the specification, and the MNEXT and MPOS fields point to the actaul mate alignment rather than the primary alignment. In which case UMI-tools will output the expected pair.
Thank you for the careful explanation—it all makes sense. I did not realize that the MNEXT and MPOS fields of a secondary read1 point to the primary alignment of read2. Thanks for the great tool too.
I have bams that contain multimappers (secondary alignments). All alignments (primary and secondary) are properly paired. Are these bams suitable input for
umi_tools dedup
—i.e., willumi_tools dedup
handle the secondary alignments and keep them paired?