alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

secondary alignment bit in flag not set for unmapped mate of a pair with the other mate mapped #2190

Open ankebusch opened 3 months ago

ankebusch commented 3 months ago

Hi Alex,

I'm running STAR 2.7.10b or 2.7.11b (both show the same result). When mapping a public paired-end RNAseq sample (https://www.ncbi.nlm.nih.gov/sra/SRX18292746), two read pairs seem to give somewhat strange results. The pairs in question are SRR22319547.14732961 and SRR22319547.9907254

Before removing secondary alignments, alignments of SRR22319547.14732961 are specified as:

SRR22319547.14732961    163 chr16   4421893 3   34M =   4527108 105287  TATACTCTTTTTTTTTTTTTTTTTTTTTTTTTTT  F,,F,FFFFF,FFFF:FFF:FFFFFFFFFFFF:F  NH:i:2  HI:i:1  AS:i:92 nM:i:5  MD:Z:34 NM:i:0
SRR22319547.14732961    83  chr16   4527108 3   72M28S  =   4421893 -105287 GACGGAGTCTCGCACTATCGCCCAGGCTGGAATGCAGTGGAGCGATCTCGGCTCACTGCAAGCTCCGCCTCCCGGGTTCAGGCCATTCTCCTTCCTCAGC    F:FFFFFFFFFFF,FFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFF,F:FFFF:FFF:F::FF:F:F:FFF,F:FFF:FFFF:FFFF,:FF:FFF    NH:i:2  HI:i:1  AS:i:92 nM:i:5  MD:Z:16G14G8C1G5T23 NM:i:5
SRR22319547.14732961    345 chr9    12779295    3   100M    *   0   0   GACGGAGTCTCGCACTATCGCCCAGGCTGGAATGCAGTGGAGCGATCTCGGCTCACTGCAAGCTCCGCCTCCCGGGTTCAGGCCATTCTCCTTCCTCAGC    F:FFFFFFFFFFF,FFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFF,F:FFFF:FFF:F::FF:F:F:FFF,F:FFF:FFFF:FFFF,:FF:FFF    NH:i:2  HI:i:2  AS:i:92 nM:i:3  MD:Z:13C59A18G7 NM:i:3
SRR22319547.14732961    165 *   0   0   *   chr9    12779295    0   TATACTCTTTTTTTTTTTTTTTTTTTTTTTTTTT  F,,F,FFFFF,FFFF:FFF:FFFFFFFFFFFF:F  NH:i:0  HI:i:0  AS:i:92 nM:i:3  uT:A:4

The primary alignment is a proper alignment of both mates, while the secondary alignment has only one mate mapped. The unmapped mate has a binary flag set as being a paired read, which is unmapped and whose mate is mapped to the reverse strand. However, the bit indicating it as secondary is not set. When secondary alignments are removed (either by running samtools view -F256 or by directly using --outSAMmultNmax 1 when running STAR), this leads to the following alignments remaining:

SRR22319547.14732961    163 chr16   4421893 3   34M =   4527108 105287  TATACTCTTTTTTTTTTTTTTTTTTTTTTTTTTT  F,,F,FFFFF,FFFF:FFF:FFFFFFFFFFFF:F  NH:i:2HI:i:1    AS:i:92 nM:i:5  MD:Z:34 NM:i:0
SRR22319547.14732961    83  chr16   4527108 3   72M28S  =   4421893 -105287 GACGGAGTCTCGCACTATCGCCCAGGCTGGAATGCAGTGGAGCGATCTCGGCTCACTGCAAGCTCCGCCTCCCGGGTTCAGGCCATTCTCCTTCCTCAGC    F:FFFFFFFFFFF,FFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFF,F:FFFF:FFF:F::FF:F:F:FFF,F:FFF:FFFF:FFFF,:FF:FFF    NH:i:2  HI:i:1  AS:i:92 nM:i:5  MD:Z:16G14G8C1G5T23 NM:i:5
SRR22319547.14732961    165 *   0   0   *   chr9    12779295    0   TATACTCTTTTTTTTTTTTTTTTTTTTTTTTTTT  F,,F,FFFFF,FFFF:FFF:FFFFFFFFFFFF:F  NH:i:0  HI:i:0  AS:i:92 nM:i:3  uT:A:4

i.e. read pair SRR22319547.14732961 still occurs with two mappings, one as proper pair and one with only one mate mapped (which was removed) and one mate unmapped (which is kept). This behavior seems a bit surprising to me. Is it possible that this is a bug?

I ran STAR with the following parameters:

STAR --runMode alignReads --outStd SAM --outMultimapperOrder Random --outSAMattributes NH HI AS nM MD NM --outSJfilterReads Unique --readFilesCommand zcat --outSAMunmapped Within --genomeDir STARindex_GENCODE_release31 --outFilterMismatchNoverReadLmax 0.04 --outFilterMismatchNmax 999 --sjdbOverhang 100 --sjdbGTFfile gencode.v31.primary_assembly.annotation.gtf --readFilesIn SRR22319547_1.fastq.gz SRR22319547_2.fastq.gz

Mapping pair SRR22319547.14732961 separately (without the other reads of the sample) however gives a different mapping result. There, the proper pair is assigned as secondary alignment and removed correctly. Since both possible alignments have the same alignment score (AS:i:92), I guess it is chosen randomly, which one is picked as primary or secondary. The issue described above, only occurs in case the alignment with only mate mapped is defined as secondary.

Best and thanks a lot for looking into this, Anke.