alexdobin / STAR

RNA-seq aligner
MIT License
1.82k stars 502 forks source link

overlapping PE reads missing splice junctions? #538

Open ChristofferFlensburg opened 5 years ago

ChristofferFlensburg commented 5 years ago

Hi Alex, and thanks for building and maintaining the best RNA aligner around! :)

I have problems with overlapping paired end reads not being properly aligned across splice junctions, such as in this screen shot:

screen shot 2018-12-18 at 3 16 56 pm

Most reads align fine, properly splices to the next and previous exon, but some reads don't find the junction and softclips into the right intron and worse: aligns into left intron with mismatches before softclipping. These reads perfectly match the neighbouring exons. I found that the reads that don't find the junction are all read pairs with a lot of overlap, with small insert size.

This observation led me to suspect the --peOverlapNbasesMin options, where STAR can pre-merge overlapping paired end reads if they overlap. I tried setting it to 0 which should switch it off (right?) and to 1000 which should effectively make it never trigger (right?), but both produce the same output. I've played around a lot with different settings otherwise, and keep getting this behaviour. Last call was with these settings:

STAR --runThreadN 5 --genomeDir myGenomeDir --readFilesIn mysample_R1.fastq.gz mySample_R2.fastq.gz --twopassMode Basic --sjdbOverhang 150 --outFilterMultimapScoreRange 1 --outFilterMultimapNmax 20 --outFilterMismatchNmax 10 --alignIntronMax 500000 --alignMatesGapMax 1000000 --sjdbScore 2 --outFilterScoreMinOverLread 0.66 --outFilterMatchNminOverLread 0.66 --chimSegmentReadGapMax 6 --scoreGapNoncan 0 --scoreGapGCAG 0 --scoreGapATAC 0 --alignSJstitchMismatchNmax 5 -1 5 5 --chimOutType WithinBAM --chimJunctionOverhangMin 2 --limitSjdbInsertNsj 2273673 --alignSJDBoverhangMin 1 --peOverlapNbasesMin 0 --outFileNamePrefix myOutput/ --readFilesCommand gunzip -c --outSAMtype BAM SortedByCoordinate

I'm not sure where to go next. There are workarounds like pre-merging read pairs, or aligning as single end reads, but I'd like something that doesn't require special treatments of some samples. I am doing variant calling from these alignments, so it's a pretty serious issue, as this sample gives hundreds of splice site variants in this way. :/ Do you have any ideas? What can I give you to help you investigate or reproduce? I can give you a read pair from the above screen shot, if that helps:

read 1 (aligns across left junction, softclips right junction): TTGCGCTAAGGACCATCACACTGTCCTTAATGGACAATGCATAGAGATGGTTCAGCATAACATGGTTGGGCTCAGGGAGTAAGGCTGGGTCACAAGAAATATTAGTGTCTTTGTTAAGAATAACTTGAAGTAGATGAGG read 2 (aligns across right junction, aligns 7 bases into left intron with 2 mismatches, then soft clips): GGGTTGCGCTAAGGACCATCACACTGTCCTTAATGGACAATGCATAGAGATGGTTCAGCATAACATGGTTGGGCTCAGGGAGTAAGGCTGGGTCACAAGAAATATTAGTGTCTTTGTTAAGAATAACTTGAAGTAGATGAG

Thanks again, and all the best! /Christoffer

alexdobin commented 5 years ago

Hi Christoffer,

you are right that the problems here are caused by overlapping mates. If you map the mates separately, the junctions are recovered correctly in each read, I believe. For overlapping mates, you need to use --peOverlapNbasesMin with min_overlap >0. I used --peOverlapNbasesMin 5, and got the fully spliced alignments (see below) for the PE read. Note that I had to reverse-complement the sequence of the 2nd mate (I guess you took the sequences from the SAM output where they are always reported on +strand).

Cheers Alex

1 97 chr1 147159618 255 25M2069N69M659N45M = 147159615 2866 TTGCGCTAAGGACCATCACACTGTCCTTAATGGACAATGCATAGAGATGGTTCAGCATAACATGGTTGGGCTCAGGGAGTAAGGCTGGGTCACAAGAAATATTAGTGTCTTTGTTAAGAATAACTTGAAGTAGATGAGG NH:i:1 HI:i:1 AS:i:285 nM:i:0 1 145 chr1 147159615 255 28M2069N69M659N44M = 147159618 -2866 GGGTTGCGCTAAGGACCATCACACTGTCCTTAATGGACAATGCATAGAGATGGTTCAGCATAACATGGTTGGGCTCAGGGAGTAAGGCTGGGTCACAAGAAATATTAGTGTCTTTGTTAAGAATAACTTGAAGTAGATGAG NH:i:1 HI:i:1 AS:i:285 nM:i:0

ChristofferFlensburg commented 5 years ago

Thanks!

So I thought there were issues with the read pair merging and tried to switch it off, while I needed to switch it on!! :o

I realligned with --peOverlapNbasesMin 5 and it's working better! While I previously had around 10-15 reads with that mismatch, I now only have two! Which makes it unlikely to be called by a variant caller, but still not ideal, so I looked closer at those two:

screen shot 2018-12-20 at 3 22 35 pm

They both have reasonable insert sizes (when you subtract the intron size), so no overlap, but they still align the 7 bases with two mismatches to the intron rather than a 7/7 match (and more with the softclipped part) to the previous exon. The both have mismatched in the last few base pairs before the junction (at low base quality), I guess that makes STAR not recognise the junction? Can you suggest any settings that will make STAR more prone to align across the junction to a matching exon rather than into intron with mismatches? Essentially I never want to align to the intron with mismatches if there is a better match in an exon, even if it's just a single base stretching into the intron.

I realise this follow up question deviates from the title of the issue, and that the end of the year holidays are getting close. This is a lot better than before, so don't worry if you don't have time to look into this at this point, thanks!

If you do have time, the settings were the ones above, with only --peOverlapNbasesMin changed to 5 as you suggested, and the blue read in the screenshot is: CCCACCCATCCTTCCTCCTCATCTACTTCAAGTTATTCTTAACAAAGACACTAATATTTCTTGTGACCCAGCCTTACTCCCTGAGCCCAACCATGTTATGCTGAACCATCTCTATGCATTGTCCATTACGGACAGTGTGAT (you're right, I just "copy read sequence" from IGV, which gave the alignment rather than the read, so reverse complimented! I sent this through an online reverse-compliment converted, so should match the original read. :D)

with the mate (aligning well to the UTR): GAACCCATGCAAGAAAAACTGCCCTTGCAATGAGGCAGCTGTTAAGTAGGGCAGCGGTCTAGCACAAGCCCATTCCTTGGAGAGCCAAGGTAGCAGTGATAGAATCTCCACCCAACAAGGATCCTAAGAGTTTGCTGTTCT

Thanks for your time!

alexdobin commented 5 years ago

Hi Christoffer,

your examples are actually illustrating the largest (I believe) source of errors that remain in STAR alignments, namely the spliced reads with short overhangs (<10-15b) with mismatches close to the splice junction, on the longer overhang side. There is presently no simple way to fix this problem. I have ideas how to fix these misalignments, however, it will require a substantial reworking of the algorithm. You could, in principle, require end-to-end alignments, which will eliminate these alignments. However, it will create a significant number of misalignments to pseudogene, which may generate false SNV calls. I think the best way to deal with such alignments is to filter them after mapping. You could look for alignments that extend past the annotated splice site by <15b and have a mismatch within 15b before the splice site (i.e. on the longer overhang), and remove them or clip them at the splice site. In general, for SNV detection in RNA-seq, I would recommend not trusting mismatches that occur in the last few bases, as they may be caused by the inability of the aligner to find a spliced alignment with short overhang.

The "red" read in your example is a bit worrisome to me, as the overhang seems to be quite large. If you could send me the sequence, I will check why it was not mapped.

Cheers Alex

ChristofferFlensburg commented 5 years ago

Thanks!

I get that these are difficult reads, and trying to find splice variants from RNA-Seq is kindof asking for trouble. Still, sometimes you get these cases of true positives and you can see them clearly in the aligned reads due to intron retention (due in turn to the splice variant). So here I am asking you to not produce any false positives in the 100k+ splice sites genomewide. Sorry about that! :P

Do you happen to have a script around that does the post-alignment read filtering? Or know anyone that has one that works? Just to not reinvent the wheel. Otherwise I can do it myself, no problem. I'm developing a software that includes finding somatic SNVs, so maybe I should just add a check at variant level for RNA-Seq and flag false splice site variants caused by this... If I do the check after variant effect prediction I can restrict to splice site variants and it shouldn't affect runtime much. 🤔

The red read (aligned as in the screenshot above): CTACTTCAAGTTATTCTTAACAAAGACACTAATATTTCTTGTGACCCAGCCTTACTCCCTGAGCCCAACCATGTTATGCTGAACCATCTCTATGCATTTTCCTTTAAGGACAGTGTGTTGGTCCTTAGAATAACCCATCAC The mate (aligned well to the UTR): GGTCAGGTATGGGATCTGAAGGAGGCTATTAAAGGAAGAATGTGAAATGAGAGGAATAAGGGTTCTTGTGGGGCATCTGGTATCATATAAACAACACAGATTGTGCCACATACTCTGGGAATATAGAGAACCCATGCA

Thanks again, good luck and happy new year! :)