alexdobin / STAR

RNA-seq aligner
MIT License
1.85k stars 505 forks source link

Tandem Duplication is Undesirably Soft-Clipped #875

Open DarioS opened 4 years ago

DarioS commented 4 years ago

I have identified an important frameshift variant in TP53 using whole genome DNA sequencing and Strelka2 variant calling and Bowtie2 mapping. I wanted to see if it was identified in matched RNA sequencing data. However, STAR is soft-clipping the duplicated sequence. Using hg38 coordinates, the variant is identified in DNA at position chr17:7673751

REF C ALT CGGAGATTCTCTTCCTCTGT

Note that the alternative sequence is identical to the reference genome sequence; this is a tandem duplication. The beginning of this variant is shown by the red box in IGV (region marker). On the other hand, STAR aligns RNA-seq reads shown below to the reference genome across this region and soft-clips the last 13 bases of the reads where all the coloured mismatches are shown.

image

Actually, the last 14 bases (there's a coincidental G match to the reference sequence immediately before the soft-clipping, indicated by an underline) all should be part of the tandem duplication. So, STAR should probably report an insertion at chr17:7673770, such that:

REF T ALT TGGAGATTCTCTTCC

I realise that even if STAR reported this indel, I still could not compare between Bowtie2 and STAR alignments, because it seems that Bowtie2 reports the left-most repeated sequence as the variant whereas STAR reports the right-most repeated sequence as the variant. I notice this difference often for many insertions I've looked at using IGV with matched DNA and RNA data, unlike deletions and SNVs.

alexdobin commented 4 years ago

Hi Dario,

even if STAR could detect an insertion so close to the read end, it would still score the soft-clipped alignment higher than the one with the big insertion.

If you want to catch such reads in the RNA-seq data, I would recommend creating and artificial "chromosome" containing a few hundred bases around this insertion (or maybe the entire gene region) and include the duplication in the sequence.

Cheers Alex

suhrig commented 4 years ago

Hi Alex,

Is there a general solution to this problem? Aligning against an artificial chromosome is only an option if you know what you are looking for. Is there are way to tweak the alignment parameters that a chimeric alignment is produced? What confuses me is that sometimes STAR reports a chimeric alignment for the clipped part, but at an unrelated locus and with poorer alignment score than the correct locus (the internal tandem duplication) would produce. Why does the chimeric detection favor a distant poor-quality alignment over a local tandem alignment?

Here are five examples of oncogenic ITDs where STAR does not report a chimeric alignment: 3 different FLT3 ITDs, a BCOR ITD, and a CDKN2A ITD. With some tweaking of the parameters a chimeric alignment is reported for some of them, but not the correct one. Can the parameters be tweaked in a way that a local ITD alignment is preferred?

@BCOR/1
GCTCCTTACTTTCAGGGTTGAAGGCTTCCAAAGATACAGAGGAGCCCAGCAGAGTCTGAATTTCGTTCGTGAATTCCACCAGATCTAACAG
+
AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@BCOR/2
ATTGTCACCATTGCAGAGGCAGAATTTTATCGGCAGGTTTCTGCAAGTCTCTTGTTCTCTTGCTCCAAAGACCTGGAAGCCTTCAACCCTG
+
AAFFFJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJFJJFFFFJJJJJJJJJJJAJJJJJJJJJJJJJJ

@CDKN2A/1
CAGGAAGCCCTCCCGGGCAGCGTCGTGCACGGGTCGGGTGAGAGTGGCGGGGTCGGGTGAGAGTGGCGGGGTCGGCGCAGTTGGGCTCCGCGCCGTGGAGCAGCAGCAGCTCCGCCACTCGGGCGCTGCCCATCATCATGACCTGGTCTTC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJFJJJJJJJAJJJJJFJJJJJJJFFJJJJJJJJFJJJJJJJJJJJJJAJJJJJJJJJJFJJJJJJJJJJFJJJJJJJJJJJJJJFJJJ7
@CDKN2A/2
TCCTAGAAGACCAGGTCATGATGATGGGCAGCGCCCGAGTGGCGGAGCTGCTGCTGCTCCACGGCGCGGAGCCCAACTGCGCCGACCCCGCCACTCTCACCCGACCCCGCCACTCTCACCCGACCCGTGCACGACGCTGCCCGGGAGGGCT
+
AAFFFJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ<JJJJJ<JAJJJJJJJJJJJJJJJJAJFJJJJJJJJJJJJJJJJFJJJJJJJJFJJJJJFJJJJFJAJAJJJJJ<-F-A<<-

@FLT3/1
CTGAAATCAACGTGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTGTACCATCTGTAGCTGGCTTTCATACCTAAATTGCTTTTTGTACAGATCGGAAGAGCACACGTCTGAACTCCAG
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEAEEEEE/AE/EEEEEEEEEEEEEEEEEEEEEEEE<EAEEEEEEEAEEEEEEEAEEEEA66<AEEEAAAEEE<<<AA
@FLT3/2
GTACAAAAAGCAATTTAGGTATGAAAGCCAGCTACAGATGGTACAGGTGACCGGCTCCTCAGATAATGAGTACTTCTACGTTGATTTCAGAGAATATGAATATGATCACGTTGATTTCAGAGATCGGAAGAGCGTCGTGTAGGGAAAGAG
+
AAAAAEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEAE/EEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEAEEAEE<A/AAAAEEAA<EAEE/EE

@FLT3/1
CCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTGTACCATCTGTTCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTGTAGCATCTGTAGCTGGCTTTCATACCT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEA/6//E//6<A/EEE//EE//EA<6A//<E/6</E///6///E//E/A/A///////////<AA////
@FLT3/2
CTCCTCTTCATTGTCGTTTTAACCCTGCTAATTTGTCACAAGTACAAAAAGCAATTTAGGTATGAAAGCCAGCTACAGATGGTACAGGTGACCGGCTCCTCAGATAATGAGTACTTCTACGTTGAAAAGATGGAACATGTTAAAGAATAC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEAEAEEEEEEEAEEEEEEEEEAEEA<EEEEEAEEEEEEEEEEEEEEEEEEAAEEAA////AE/</</A/A//A//<////E

@FLT3/1
CCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAAATTCATATTCTCTGAAATCAACGTAGAAGTACTCATT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F
@FLT3/2
TGTTTGTCTCCTCTTCATTGTCGTTTTAACCCTGCTAATTTGTCACAAGTACAAAAAGCAATTTAGGTATGAAAGCCAGCTACAGATGGTACAGGTGACCG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFFF:FF:FFF:FFFFFFFFFFFFFFF:FFFFF,:FF

Many thanks, Sebastian