DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
473 stars 116 forks source link

Primary and secondary alignments are output over very same locus, just increasing redundant clutter in secondary matches #274

Open mmokrejs opened 3 years ago

mmokrejs commented 3 years ago

Hi, is it possible to make hisat2 to omit secondary alignments to very same locus? Maybe the current behavior explains your findings:

HISAT 2.2.0 release 2/6/2020

This major version update includes a new feature to handle “repeat” reads. Based on sets of 100-bp simulated and 101-bp real reads that we tested, we found that 2.6-3.4% and 1.4-1.8% of the reads were mapped to >5 locations and >100 locations, respectively. Attempting to report all alignments would likely consume a prohibitive amount of disk space. In order to address this issue, our repeat indexing and alignment approach directly aligns reads to repeat sequences, resulting in one repeat alignment per read.

Here is the same read pair as primary and secondary match. The start position differs only in 2nt start position.

$ samtools view -F 0x100 suspFAP117-UEMpanelrun01_S1.trimmomatic.pairs.hisat2.removed_duplicates.bam |  grep M05378:215:000000000-J5BH5:1:2107:15473:12690
M05378:215:000000000-J5BH5:1:2107:15473:12690   163 chr1    75881224    60  12M2I62M    =   75881401    253 GTCTTACCAAACATGTGTTTTCAGGATAGTGTGCAAACTGCTTAGTGAGATTTATGAACATATTCATTGCTTATAT    CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGD    MD:Z:74 PG:Z:MarkDuplicates RG:Z:J5BH5.1.suspFAP117 XG:i:2  NH:i:2  NM:i:2  XM:i:0  XN:i:0  XO:i:1  AS:i:-11    YS:i:0  ZS:i:-50    YT:Z:CP
M05378:215:000000000-J5BH5:1:2107:15473:12690   83  chr1    75881401    60  76M =   75881224    -253    GCATAGTTAAATTTGTATTCGATTCAAACAATTTGATATAATAGTTATGACATTTAAAATTTTTAACTTGAAATAG    GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC    MD:Z:76 PG:Z:MarkDuplicates RG:Z:J5BH5.1.suspFAP117 XG:i:0  NH:i:2  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YS:i:-11    YT:Z:CP
$

$ samtools view -f 0x100 suspFAP117-UEMpanelrun01_S1.trimmomatic.pairs.hisat2.removed_duplicates.bam |  grep M05378:215:000000000-J5BH5:1:2107:15473:12690
M05378:215:000000000-J5BH5:1:2107:15473:12690   419 chr1    75881222    60  76M =   75881401    255 GTCTTACCAAACATGTGTTTTCAGGATAGTGTGCAAACTGCTTAGTGAGATTTATGAACATATTCATTGCTTATAT    CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGD    MD:Z:0A1G1C0T0T0A0C0C1A1C62 PG:Z:MarkDuplicates RG:Z:J5BH5.1.suspFAP117 XG:i:0  NH:i:2  NM:i:10 XM:i:10 XN:i:0  XO:i:0  AS:i:-50    YS:i:0  ZS:i:-50    YT:Z:CP
M05378:215:000000000-J5BH5:1:2107:15473:12690   339 chr1    75881401    60  76M =   75881222    -255    GCATAGTTAAATTTGTATTCGATTCAAACAATTTGATATAATAGTTATGACATTTAAAATTTTTAACTTGAAATAG    GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC    MD:Z:76 PG:Z:MarkDuplicates RG:Z:J5BH5.1.suspFAP117 XG:i:0  NH:i:2  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YS:i:-50    YT:Z:CP
$

I used hisat2-2.2.1/hisat2-align-s --bowtie2-dp 2 --score-min L,0,-1 --no-softclip --fr --all for the alignment. I would like to collect only non-overlapping primary and secondary alignments, e.g. no overlap within regions whe the reads align.