alexdobin / STAR

RNA-seq aligner
MIT License
1.82k stars 503 forks source link

Use STAR to detect gene fusion #780

Open gongyuTang123 opened 4 years ago

gongyuTang123 commented 4 years ago

Hi Alex,

I am trying to detect gene fusion by chimeric alignment. I set the --chimSegmentMin as 15. But in the chimeric.out.junction file, there are still some alignments like this.

chr14 92853734 + chrX 110546838 + 2 5 1 A00585:98:HKWH7DSXX:1:1113:31150:16924 92853689 45M155S 110546839 45S155M 2 GRPundef

Seems like it only maps to 2 bases in chrX. I also tried the parameters recommended in the STAR-fusion page. Still got a similar result. Some lines show only 2 bases mapped in one chimeric alignment. How could we solve this?

Thanks, Gongyu

alexdobin commented 4 years ago

Hi Gongyu,

it maps 45b to chr14 and 155b to chrX according to CIGARs 45M155S and 45S155M 110546838 is the last base of the fusion intron, 110546839 is the start of the 155M alignment

Cheers Alex

gongyuTang123 commented 4 years ago

Hi Alex,

Thanks for your reply, I tried to use cigar strings to find how many bases are mapped and seems it works for every line I got. But for a few lines, I got the cigar string in this way 18446744073709551615S151M1S. These lines appear when I use the adaptor clipping function in STAR.

How could I avoid these lines?

Thanks, Gongyu

alexdobin commented 4 years ago

Hi Gongyu

this looks like a bug - what parameters are you using? Please try to map just one read where you see this bad CIGAR and send me the SAM line.

Cheers Alex

gongyuTang123 commented 4 years ago

This is the parameter I used for STAR mapping.

STAR --readFilesIn $readFileIn --alignIntronMax 100000 --alignIntronMin 20 --alignMatesGapMax 1000000 --alignSJDBoverhangMin 10 --alignSJoverhangMin 8 --alignSoftClipAtReferenceEnds Yes --chimJunctionOverhangMin 12 --chimMainSegmentMultNmax 1 --chimOutType Junctions --chimSegmentMin 12 --chimOutJunctionFormat 1 --genomeDir $update_genome --genomeLoad LoadAndKeep --limitSjdbInsertNsj 1200000 --outFileNamePrefix $out_directory --outFilterIntronMotifs None --outFilterMatchNminOverLread 0.33 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --outFilterMultimapNmax 20 --outFilterScoreMinOverLread 0.33 --outFilterType BySJout --outSAMattributes NH HI AS nM NM ch --outSAMstrandField intronMotif --outSAMtype SAM --outReadsUnmapped Fastx --runThreadN $run_thread --alignSJstitchMismatchNmax 5 -1 5 5 --outSAMattrRGline ID:GRPundef --chimMultimapScoreRange 3 --chimScoreJunctionNonGTAG -4 --chimMultimapNmax 20 --chimNonchimScoreDropMin 10 --peOverlapNbasesMin 12 --peOverlapMMp 0.1 --clip3pAdapterSeq AGATCGGAAGAGCACACGTC AGATCGGAAGAGCGTCGTGT --clip3pAdapterMMp 0.1

The reads are: read1: TGTTAATGGGCACACTAGGAATTGTGTGCCCCATCTGTTCTCAGAAACCATAATCTACCATGGCTGATCCTGCAGCAAACTGAGAAAACTACGGGAGGCCCTCTTTATTCAGGATTCAGCTTAATTTCTGTTGAGCCTGGAAAAGGGCTTG read2: CGGGCTGAGACAATGGGGTTTTCTAGATGTACAATCATGCCATCTGCAAACAGGGACAATTTGACTTCCTCTTTTCCTAATTGAATACCCTTTATTTCCTTCTCCTGCCTAATTGCCCTGGCCAGAACACAAATTACACTTCTTAACATGA

The SAM line is: gi|333031|lcl|HPV16REF.1| 881 + chr21 15263667 - 1 3 0 A00585:123:HW2J7DSXX:1:2205:14181:21355 806 75M76S 15263326 1S149M1S116p18446744073709551615S76M76S 1 GRPundef

Because we are trying to use STAR to detect viral integration. And part of our sequence will be aligned to human chromosome. Rest will be aligned to viral genome.