alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

Most relaxed parameters for fusion finding #133

Open Magdoll opened 8 years ago

Magdoll commented 8 years ago

Hi,

I am trying to get STAR to work for finding fusion transcripts in PacBio (Iso-Seq) data. Previously I have been using GMAP coupled with a downstream python script I wrote to find fusion transcripts.

Taking the fusion transcript sequences i found through GMAP , which is the file IsoSeq_MCF7_polished.fusion.fasta from this dataset, and tweaking STAR parameters as much as I could, the best I could get STAR to output in Chimeric.out.sam is 30 chimeric hits. out of the 93 that GMAP identified.

While I have no doubt that some of the GMAP fusions are alignment errors (I have particularly found that GMAP will tend to output something as chimeric when upon closer inspection you realize it's just two very far apart loci on the same chromosome), there appears to be definitely some that should be found by STAR that it is not finding.

The best STAR parameter I ended up using is:

STARlong --runMode alignReads --outSAMattributes NH HI NM MD \
--readNameSeparator space --outFilterMultimapScoreRange 1 \
--outFilterMismatchNmax 2000 --scoreGapNoncan -1 --twopassMode None \
--scoreGapGCAG -4 --scoreGapATAC -8 --scoreDelOpen -1 \
--scoreDelBase -1 --scoreInsOpen -1 --scoreInsBase -1 \
--alignEndsType Local --seedSearchStartLmax 20 --seedPerReadNmax 100000 \
--seedPerWindowNmax 1000 --alignTranscriptsPerReadNmax 100000 \
--alignTranscriptsPerWindowNmax 10000 \
--chimScoreMin 1 --chimScoreDropMax 2000 --chimScoreSeparation 1 \
--chimScoreJunctionNonGTAG -1 --chimSegmentMin 5 \
--chimJunctionOverhangMin 1 --chimSegmentReadGapMax 0 \
--chimFilter None  \
--genomeDir ~/share/star_db/hg19 --genomeLoad LoadAndKeep --readFilesIn test.fa 

(the non-chimera related parameters were previously determined through parameter sweeping and documented here)

This gives me 30 chimeric hits.

Some of hits I noticed were reported in Aligned.out.sam but not Chimeric.out.sam are ones where one loci is spliced but the other loci is unspliced and also very short.

Ex: one of the GMAP-found fusion was:

chr20   NA  exon    49411637    49411710    .   +   .   BCAS4+BCAS3_2
###
chr17   NA  exon    59445688    59445855    .   +   .   BCAS4+BCAS3_1
chr17   NA  exon    59469338    59470190    .   +   .   BCAS4+BCAS3_1

because the chr20 loci is very short (< 150 bp), it is reported in Aligned.out.sam with the first 150 bp soft-clipped. Is there any way to increase sensitivity in mapping the soft-clipped part?

I would also give STAR-fusion a try, but I would like to understand STAR parameters better and at least rescue some of the hits that I believe should be reported.

alexdobin commented 8 years ago

Hi @Magdoll

I have not yet tried finding fusions in PacBio data with STARlong. Brian Haas (STAR-Fusion authors) and I are planning to work on it in the coming weeks. Would it be possible for you to send me a few of these examples, where GMAP finds a reasonable fusion and STAR does not. At the very least I would need sequence of the reads, having GMAP alignments would be nice as well.

Cheers Alex

Magdoll commented 8 years ago

Hi Alex,

Yes, the test dataset I'm using is a GMAP-based fusion dataset.

Here is the fasta

Here is the GFF

So, if STARlong were to 100% replicate GMAP's findings (which I don't expect it to, because I know some of GMAP's alignments are faulty), then every single sequence in the fasta should be mapped chimerically.

Hope this helps. Let me know if you need further clarification.

--Liz

alexdobin commented 8 years ago

Hi Liz,

thanks a lot - got the files, exactly what I need. Will update you on the progress in a few days.

Cheers Alex

brianjohnhaas commented 8 years ago

I'll do what I can to help here and integrate this into our STAR-Fusion suite, once we've got the STARlong params optimized.

best,

~b

On Tue, Apr 12, 2016 at 5:44 PM, alexdobin notifications@github.com wrote:

Hi @Magdoll https://github.com/Magdoll

I have not yet tried finding fusions in PacBio data with STARlong. Brian Haas (STAR-Fusion authors) and I are planning to work on it in the coming weeks. Would it be possible for you to send me a few of these examples, where GMAP finds a reasonable fusion and STAR does not. At the very least I would need sequence of the reads, having GMAP alignments would be nice as well.

Cheers Alex

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/alexdobin/STAR/issues/133#issuecomment-209115797

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

Magdoll commented 8 years ago

Thanks @brianjohnhaas and @alexdobin looking forward to the results!

alexdobin commented 8 years ago

Hi Liz, Brian,

please find below progress update on the fusion mapping for long reads. I am working on algorithm modifications to resolve most of the remaining problems - will take 1-2 weeks.

Cheers Alex

The mapping information for each read is added to the fasta file: http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/Fusion_LongReads/results.txt

There were 93 reads in the fasta file supplied by Liz for which GMAP found fusion junctions. STAR only found 14 fusions originally.

After some fixes in the algorithm and parameters:

STAR finds fusions for 53 reads. Most of those agree with GMAP (I only checked chromosomes, and they agree for 50).

For 40 reads STAR does not find fusion for the following reasons:

  1. 4 reads: contain non-canonical junction in the main chimeric segment, this is presently filtered by STAR - I will add a parameter to allow it, however, it may lead to increase in false fusions.
  2. 19 reads: fusions of "circular" type, i.e. the acceptor is upstream of donor on the same strand within a short distance (<1Mb). This is the fundamental problem with the STARlong algorithm - presently it does not look for the sub-par alignments in the same genomic windows. I am now modifying the algorithm, should take 1-2 weeks.
  3. 3 reads: the 2nd chimeric segment alignment is not the best in its window. This is related to the previous case, and should get fixed with the algorithm extension.
  4. 2 reads: short (<=10b) junction overhangs at the ends of chimeric alignments. Could be salvaged requiring larger --alignSJoverhangMin and --alignSJDBoverhangMin. I will create another parameter to control the min junction overhang at the ends.
  5. 1 read: short chimeric overhang (43b) with an indel in the middle, can be salvaged with --seedSearchStartLmax 10
  6. 8 reads: STAR finds good non-chimeric alignments, which also check out with BLAT. These are probably false fusions from GMAP.
  7. 2 reads: very short chimeric segments by GMAP (<20b), these do not seem reliable to me.
  8. 1 read: very high error rate for one of the exons

In summary, after modifications STAR will be able to detect fusions for cases 1-5 (29 reads). Cases 6-7 (10 reads) are likely to be false fusions from GMAP Case 8 (1 read) is the only read STAR cannot deal with.