RNAseq variant calling: STAR align promotes mismatched overhanging mappings onto introns instead of calling splicing junctions

ftucos commented 3 years ago

Hi, I'm doing variant calling from my RNAseq data on a cell line (Paired End 2X100b, Illumina) I'm following GATK workflows and everything is working almost fine since I can find all the expected mutations already characterized in my cell line. The problem is that I see some additional mutations that are clearly the result of an alignment error (e.g. NM_001364837:exon8:c.727+1G>A, rs781065280) I'm using the latest release of STAR aligner 2.7.6a to perform the alignment with the following options

STAR --runThreadN 8 --genomeDir $GENOME_DIR \
  --readFilesIn ...   --readFilesCommand gunzip -c \
  --outSAMattrRGline ... \
  --outSAMtype BAM SortedByCoordinate \
  --twopassMode Basic \
  --outSAMmapqUnique 60 \
  --outFilterType BySJout --alignIntronMin 20 \
  --alignIntronMax 1000000 --alignMatesGapMax 1000000 \
  --alignSJoverhangMin 8 --alignSJDBoverhangMin 3 --scoreGenomicLengthLog2scale 0 \
  --outFileNamePrefix $OUTPUT_DIR/...

The problem resides in STAR mapping some reads with a mismatching overhang of 4 bp at beginning of the intron instead of picking a perfect match at the beginning of the next exon:

As you can see, --alignSJDBoverhangMin is set to 3 and the length of the overhang is 4bp. Also the splice junction should be canonical so a spliced mapping should carry no penalty. Am I doing something wrong?

alexdobin commented 3 years ago

Hi @ftucos

is this junction annotated in the GTF you used to generate the genome? This is necessary to detect junctions with short overhangs. It should be the actual junction in a transcript, not just the exons. If this is the case, then you would need to look further to the left if there are any mismatches/indels close by - this may prevent the detection of the junction. Other reads seem to be spliced properly to this donor/acceptor sites.

Cheers Alex

ftucos commented 3 years ago

Thank you for your reply @alexdobin, what do you mean by "it's annotated in the GFT I used"? I used the genome annotation file provided by Ensembl ftp://ftp.ensembl.org/pub/release-101/gtf/homo_sapiens. The only annotated features for each gene are: Genes, whole transcripts, UTRs, start and stop codons and exons. No junctions are explicitly specified. From my understanding of the STAR manual, STAR would compute the splice junctions at the Genome Indexing step

--sjdbGTFfile specifies the path to the file with annotated transcripts in the standard GTF format. STAR will extract splice junctions from this file and use them to greatly improve accuracy of the mapping.

De facto, the junction is defined in the sjdbList.fromGTF.out.tab file as expected:

14 64419926 64424803 + 43079

This is the file included in the genomeDir so I don't specify elsewhere the junctions list.

Also, I checked the presence of mutations near the splice junction as you suggested; there are others erroneous overhangs that carry some mismatches in the last part of the exon, but the one I'm showing you in the example is perfectly matched. Also, it's not completely clear to me why some mismatches should promote further mismatches rather than a perfect and canonical spliced match.

alexdobin commented 3 years ago

Hi @ftucos,

De facto, the junction is defined in the sjdbList.fromGTF.out.tab file as expected: 14 64419926 64424803 + 43079 this is exactly what I meant. You are right that the GTF annotate the exons - STAR extract junctions as connections between consecutive exons that belong to the same transcript (i.e. same transcript_id).

Mismatches that are close to the junction boundaries make the junction harder to detect - so, instead, it takes the easier path with unspliced alignments. If you can extract this single read, and send me the sequence, I can try to explain why it's not mapping it "properly".

Cheers Alex

ftucos commented 3 years ago

Here you have some of the sequences I'm struggling with CCTGAAATGGTTAAAGGGGAGTGGATCAAACCTGGGGCAATAGTCATCGACTGTGGGATCAATTATGTTCCCGATG CTGAAATGGTTAAAGGGGAGTGGATCAAACCTGGGGCAATAGTCATCGACTGTGGAATCAATTATGTCGCAGATGA CCTGAAATGGTTAAAGGGGAGTGGATCGAACCTGGGGCAGTAGTCATCGGCTGTGGAATCAATGATGTCCCAAATG

alexdobin commented 3 years ago

Hi Alex,

I have mapped these reads, and they all appear to have one or more mismatches in the close vicinity of the splice junctions. You can see it in the MD tag in SAM or in the BLAT alignment below. Such cases cannot be properly mapped to the splice junctions with short overhang, with the present algorithm.

Chers Alex

1       0       14      64419853        255     76M     *       0       0       CCTGAAATGGTTAAAGGGGAGTGGATCAAACCTGGGGCAATAGTCATCGACTGTGGGATCAATTATGTTCCCGATG    *       NH:i:1  HI:i:1  AS:i:66 nM:i:4  NM:i:4  MD:Z:56A11C2A1G2
2       0       14      64419854        255     76M     *       0       0       CTGAAATGGTTAAAGGGGAGTGGATCAAACCTGGGGCAATAGTCATCGACTGTGGAATCAATTATGTCGCAGATGA    *       NH:i:1  HI:i:1  AS:i:70 nM:i:2  NM:i:2  MD:Z:68C3G3
3       0       14      64419853        255     72M4S   *       0       0       CCTGAAATGGTTAAAGGGGAGTGGATCGAACCTGGGGCAGTAGTCATCGGCTGTGGAATCAATGATGTCCCAAATG    *       NH:i:1  HI:i:1  AS:i:62 nM:i:4  NM:i:4  MD:Z:27A11A9A13T8

alexdobin / STAR

RNAseq variant calling: STAR align promotes mismatched overhanging mappings onto introns instead of calling splicing junctions #1056