Open Magdoll opened 8 years ago
Hi @Magdoll
I have not yet tried finding fusions in PacBio data with STARlong. Brian Haas (STAR-Fusion authors) and I are planning to work on it in the coming weeks. Would it be possible for you to send me a few of these examples, where GMAP finds a reasonable fusion and STAR does not. At the very least I would need sequence of the reads, having GMAP alignments would be nice as well.
Cheers Alex
Hi Alex,
Yes, the test dataset I'm using is a GMAP-based fusion dataset.
Here is the fasta
Here is the GFF
So, if STARlong were to 100% replicate GMAP's findings (which I don't expect it to, because I know some of GMAP's alignments are faulty), then every single sequence in the fasta should be mapped chimerically.
Hope this helps. Let me know if you need further clarification.
--Liz
Hi Liz,
thanks a lot - got the files, exactly what I need. Will update you on the progress in a few days.
Cheers Alex
I'll do what I can to help here and integrate this into our STAR-Fusion suite, once we've got the STARlong params optimized.
best,
~b
On Tue, Apr 12, 2016 at 5:44 PM, alexdobin notifications@github.com wrote:
Hi @Magdoll https://github.com/Magdoll
I have not yet tried finding fusions in PacBio data with STARlong. Brian Haas (STAR-Fusion authors) and I are planning to work on it in the coming weeks. Would it be possible for you to send me a few of these examples, where GMAP finds a reasonable fusion and STAR does not. At the very least I would need sequence of the reads, having GMAP alignments would be nice as well.
Cheers Alex
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/alexdobin/STAR/issues/133#issuecomment-209115797
Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas
Thanks @brianjohnhaas and @alexdobin looking forward to the results!
Hi Liz, Brian,
please find below progress update on the fusion mapping for long reads. I am working on algorithm modifications to resolve most of the remaining problems - will take 1-2 weeks.
Cheers Alex
The mapping information for each read is added to the fasta file: http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/Fusion_LongReads/results.txt
There were 93 reads in the fasta file supplied by Liz for which GMAP found fusion junctions. STAR only found 14 fusions originally.
After some fixes in the algorithm and parameters:
STAR finds fusions for 53 reads. Most of those agree with GMAP (I only checked chromosomes, and they agree for 50).
For 40 reads STAR does not find fusion for the following reasons:
In summary, after modifications STAR will be able to detect fusions for cases 1-5 (29 reads). Cases 6-7 (10 reads) are likely to be false fusions from GMAP Case 8 (1 read) is the only read STAR cannot deal with.
Hi,
I am trying to get STAR to work for finding fusion transcripts in PacBio (Iso-Seq) data. Previously I have been using GMAP coupled with a downstream python script I wrote to find fusion transcripts.
Taking the fusion transcript sequences i found through GMAP , which is the file
IsoSeq_MCF7_polished.fusion.fasta
from this dataset, and tweaking STAR parameters as much as I could, the best I could get STAR to output in Chimeric.out.sam is 30 chimeric hits. out of the 93 that GMAP identified.While I have no doubt that some of the GMAP fusions are alignment errors (I have particularly found that GMAP will tend to output something as chimeric when upon closer inspection you realize it's just two very far apart loci on the same chromosome), there appears to be definitely some that should be found by STAR that it is not finding.
The best STAR parameter I ended up using is:
(the non-chimera related parameters were previously determined through parameter sweeping and documented here)
This gives me 30 chimeric hits.
Some of hits I noticed were reported in
Aligned.out.sam
but notChimeric.out.sam
are ones where one loci is spliced but the other loci is unspliced and also very short.Ex: one of the GMAP-found fusion was:
because the chr20 loci is very short (< 150 bp), it is reported in
Aligned.out.sam
with the first 150 bp soft-clipped. Is there any way to increase sensitivity in mapping the soft-clipped part?I would also give STAR-fusion a try, but I would like to understand STAR parameters better and at least rescue some of the hits that I believe should be reported.