CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

Using umi_tools dedup with bams containing multimappers #574

Closed kalavattam closed 1 year ago

kalavattam commented 1 year ago

I have bams that contain multimappers (secondary alignments). All alignments (primary and secondary) are properly paired. Are these bams suitable input for umi_tools dedup—i.e., will umi_tools dedup handle the secondary alignments and keep them paired?

kalavattam commented 1 year ago

Examining the flags in bams (bulk RNA-seq, paired-end sequencing) composed of only (a) "proper" primary alignments, (b) both "proper" primary and secondary alignments, and (c) only "proper" secondary alignments, it looks like the multimappers are deduplicated along with the primary alignments, but pair status seems to not be maintained for primary and secondary alignments in the bam outfiles for (b) and (c). Why is this? Is there a way to call umi_tools dedup so that, when secondary alignments are present, the pair status is maintained?

(column 1: counts; column 2: bam flag)

# (a) primary alignments ----------------------------------
# infile (not deduplicated)
test_umi-tools-dedup/5781_Q_Is_UT.primary.list-tally-flags.txt 
1033332 83
1033332 163
1011879 99
1011879 147

# UMI deduplication (method: directional)
test_umi-tools-dedup/5781_Q_Is_UT.primary.UMI-dedu.list-tally-flags.txt 
1024479 83
1024479 163
1003082 99
1003082 147

# "positional" deduplication (--ignore-umi)
test_umi-tools-dedup/5781_Q_Is_UT.primary.pos-dedu.list-tally-flags.txt 
 979507 83
 979507 163
 974327 99
 974327 147

# (b) primary and secondary alignments --------------------
# infile (not deduplicated)
test_umi-tools-dedup/5781_Q_Is_UT.primary-secondary.list-tally-flags.txt
1033332 83
1033332 163
1011879 99
1011879 147
 289272 419
 289272 339
  52918 403
  52918 355

# UMI deduplication (method: directional)
test_umi-tools-dedup/5781_Q_Is_UT.primary-secondary.UMI-dedu.list-tally-flags.txt
1023714 163
1023516 83
1002890 147
1002695 99
 286338 339
 277696 419
  52292 355
  43349 403

# "positional" deduplication (--ignore-umi)
test_umi-tools-dedup/5781_Q_Is_UT.primary-secondary.pos-dedu.list-tally-flags.txt
 973699 147
 973501 99
 962868 163
 962657 83
 177502 339
 169071 419
  49472 355
  40771 403

# (c) secondary alignments --------------------------------
# infile (not deduplicated)
test_umi-tools-dedup/5781_Q_Is_UT.secondary.list-tally-flags.txt
 289272 419
 289272 339
  52918 403
  52918 355

# UMI deduplication (method: directional)
test_umi-tools-dedup/5781_Q_Is_UT.secondary.UMI-dedu.list-tally-flags.txt
 287206 339
 282219 419
  52414 355
  47638 403

# "positional" deduplication (--ignore-umi)
test_umi-tools-dedup/5781_Q_Is_UT.secondary.pos-dedu.list-tally-flags.txt
 190976 339
 186087 419
  50084 355
  45395 403

How I did UMI deduplication (directional)

umi_tools dedup \
    --paired \
    --spliced-is-unique \
    --unmapped-reads=discard \
    --stdin={b_infile} \
    --stdout={b_umi_dedu} \
    --temp-dir={scratch} \
    --output-stats={b_umi_stats} \
    --log={b_umi_log} \
    --error={b_umi_error} \
    --timeit={b_umi_timeit} \
    --timeit-header

How I did positional deduplication

umi_tools dedup \
    --ignore-umi \
    --paired \
    --spliced-is-unique \
    --unmapped-reads=discard \
    --stdin={b_infile} \
    --stdout={b_pos_dedu} \
    --temp-dir={scratch} \
    --log={b_pos_log} \
    --error={b_pos_error} \
    --timeit={b_pos_timeit} \
    --timeit-header
IanSudbery commented 1 year ago

Hi Kris,

As you will have seen, UMI-tools treats secondary alignments just like it treats any other reads. UMI-tools makes the decisions on which reads to output on the basis of information contained in read1 (including the value of the "mate position" field). When a decision is made to output read1, the read2 listed in the MNEXT and MPOS fields for that read are added to a buffer of reads that need outputting when they are encountered.

Different aligners handle which values go into the MNEXT and MPOS fields differently. The format specification specifies that the values in the MNEXT and MPOS fields of read1 must always point to the primary alignment of read2. So while the reads might be mapped as a pair, they may not point to each other as mates. Unfortunately, as the information about the "real" mate is not encoded anywhere in the BAM file, there is not much that UMI-tools can do about this. However, some aligners ignore the specification, and the MNEXT and MPOS fields point to the actaul mate alignment rather than the primary alignment. In which case UMI-tools will output the expected pair.

kalavattam commented 1 year ago

Thank you for the careful explanation—it all makes sense. I did not realize that the MNEXT and MPOS fields of a secondary read1 point to the primary alignment of read2. Thanks for the great tool too.