Closed dauss75 closed 5 years ago
Hi Segun,
Nice one, that looks pretty awesome. We are definitely up for taking pull requests but we could also work on adding it together. The first step would be running and making sure it actually runs and we can reproduce the results. Then we can take what we did and formalize it in bcbio. Have you run arriba yet? I think we already produce the STAR alignments with the proper flags so it would be nice to figure out if we could hook up what we are already doing to arriba.
Thanks Rory. We use bcbio pipeline and my colleague had been testing in parallel hoping to integrate with bcbio later if we observe decent performance. So far the results are promising and outperforming the two fusion tools available in bcbio for our target genes. Please let me know if you need any help from my end.
Thanks, Sebastian was super nice and updated the bioconda recipe so it will be compatible with bcbio, so that's one step down.
Thanks Rory for the follow-up. It's very exciting to see the progress. Please keep me posted. Regards, Segun
I have this functional for GRCh37 and am testing it now on https://www.ncbi.nlm.nih.gov/bioproject/PRJNA252360/ to make sure the results are reasonable on a toy dataset. @areddy965 has some nice real samples to test it on once we finish having it integrated.
Thanks so much for the hard work. That is really exciting news! Can’t wait to try this tool in bcbio.
I have implemented support for arriba with GRCh37 only. If you update your data, installation and tools with:
bcbio_nextgen.py upgrade -u development --tools --data --genomes GRCh37
it will pull in the fusion blacklist files from arriba and will run arriba if you use STAR as the aligner and add arriba
to the fusion_callers
list. I've run it and it is functional, but haven't looked at the output yet to see if it is doing a reasonable job. But this is a good starting point for us to figure out if it is doing a good job and how we can tweak it to doing better. I'm out tomorrow but will look at the results later on this week, and am now working on hg38 support, which is going to be more involved as arriba needs the STAR aligner and STAR doesn't support hg38 with the alts, so we have to add an alt-free STAR index.
That's awesome, thanks Rory. We'll have a look at it as well; I'm particularly interested to see if it can handle input from other aligners, particularly DRAGEN, which would allow us to support hg38 that way.
Awesome. on my vacation, but I am super excited to give a try when returning to work.
got updated and can see arriba is available. thanks. Should I simply add arriba under fusion_caller in yaml and it would work?
Hi Segun,
Yup that will do it-- when I'm done with the full release I'll update all the documentation.
awesome
Works well - it produces less noise than Pizzly, verdict is out whether it's also as sensitive. I can only test it on samples with single (driver) fusions; would love to get my hands on a public test set.
Alright. It works functionally and calls some events, but missed 5/5 driver events that I've now looked at, some with strong evidence and called by Pizzly. Even looking at the unfiltered output Pizzly has basic support for some of the fusions where Arriba sees 0/0 splitread/discordant support.
I'll try to run it outside of bcbio to see if I can get different results that way.
Running the same sample with the defaults outside of bcbio gives different results that match the expectations. This is with the same STAR install and FASTQs. If the blacklist is the same I would suspect that the GTF might be source (GENCODE19
vs bcbio's ref-transcripts.gtf
)?
Thanks Oliver, looping in our talk over Slack here:
Running Arriba with just a:
./arriba -x ../MDX190048_RNA004663_B_ALL_Case_6_T_RNA-ready.bam -o fusions.tsv -O fusions.discarded.tsv -a GRCh37.fa -g ref-transcripts.gtf -b database/blacklist_hg19_hs37d5_GRCh37_2018-11-04.tsv.gz -T -P
Finds the missed fusion and ~20 others that have strong support. That's the bcbio-generated STAR BAM, the bcbio gtf, and their blacklist.
I'm checking right now if the -i
parameter is bugged, and that's what is causing uninteresting_contigs
to get set. https://arriba.readthedocs.io/en/latest/internal-algorithm/#read-level-filters describes what uninteresting_contigs
means.
Ok! I think that is it:
with -i set
Loading annotation from '/n/app/bcbio/biodata/genomes/Hsapiens/GRCh37/rnaseq/ref-transcripts.gtf'
Loading assembly from '/n/app/bcbio/biodata/genomes/Hsapiens/GRCh37/seq/GRCh37.fa'
Reading chimeric alignments from '../align/SRR1659952/SRR1659952_star/SRR1659952.bam' (total=2882489)
Filtering multi-mappers and single mates (remaining=2882489)
Detecting strandedness (no)
Annotating alignments
Filtering duplicates (remaining=2142320)
Filtering mates which do not map to interesting contigs (1) (remaining=73956)
Estimating mate gap distributionWARNING: not enough chimeric reads to estimate mate gap distribution, using default values
Filtering read-through fragments with a distance <=10000bp (remaining=71667)
Filtering inconsistently clipped mates (remaining=71446)
Filtering breakpoints adjacent to homopolymers >=6nt (remaining=71172)
Filtering fragments with small insert size (remaining=13468)
Filtering alignments with long gaps (remaining=13468)
Filtering fragments with both mates in the same gene (remaining=13280)
Filtering fusions arising from hairpin structures (remaining=11775)
Filtering reads with a mismatch p-value <=0.01 (remaining=10967)
Filtering reads with low entropy (k-mer content >=60%) (remaining=10836)
Finding fusions and counting supporting readsWARNING: Some fusions were subsampled, because they have more than 300 supporting reads
(total=11361)
Merging adjacent fusion breakpoints (remaining=11354)
Estimating expected number of fusions by random chance (e-value)
Filtering fusions with both breakpoints in adjacent non-coding/intergenic regions (remaining=11026)
Filtering intragenic fusions with both breakpoints in exonic regions (remaining=10467)
Filtering fusions with <2 supporting reads (remaining=1682)
Filtering fusions with an e-value >=0.3 (remaining=814)
Filtering fusions with both breakpoints in intronic/intergenic regions (remaining=784)
Filtering PCR fusions between genes with an expression above the 99.8% quantile (remaining=782)
Searching for fusions with spliced split reads (remaining=783)
Selecting best breakpoints from genes with multiple breakpoints (remaining=69)
Searching for fusions with >=4 spliced events (remaining=69)
Filtering blacklisted fusions in '/n/app/bcbio/biodata/genomes/Hsapiens/GRCh37/rnaseq/fusion-blacklist/arriba-blacklist.tsv.gz' (remaining=12)
Filtering fusions with anchors <=23nt (remaining=12)
Filtering end-to-end fusions with low support (remaining=11)
Filtering fusions with no coverage around the breakpoints (remaining=11)
Indexing gene sequences
Filtering genes with >=30% identity (remaining=9)
Re-aligning chimeric reads to filter fusions with >=80% mis-mappers (remaining=6)
Selecting best breakpoints from genes with multiple breakpoints (remaining=6)
Searching for additional isoforms (remaining=6)
Assigning confidence scores to events
Writing fusions to file 'SRR1659952-fusions.tsv'
Writing discarded fusions to file 'SRR1659952-fusions.discarded.tsv'
without -i set
Loading annotation from '/n/app/bcbio/biodata/genomes/Hsapiens/GRCh37/rnaseq/ref-transcripts.gtf'
Loading assembly from '/n/app/bcbio/biodata/genomes/Hsapiens/GRCh37/seq/GRCh37.fa'
Reading chimeric alignments from '../align/SRR1659952/SRR1659952_star/SRR1659952.bam'
(total=2882489)
Filtering multi-mappers and single mates (remaining=2882489)
Detecting strandedness (no)
Annotating alignments
Filtering duplicates (remaining=2142320)
Filtering mates which do not map to interesting contigs (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y) (remaining=1543505)
Estimating mate gap distribution (mean=41.196, stddev=28.0094)
Filtering read-through fragments with a distance <=10000bp (remaining=1514872)
Filtering inconsistently clipped mates (remaining=1509037)
Filtering breakpoints adjacent to homopolymers >=6nt (remaining=1474665)
Filtering fragments with small insert size (remaining=917037)
Filtering alignments with long gaps (remaining=917037)
Filtering fragments with both mates in the same gene (remaining=915470)
Filtering fusions arising from hairpin structures (remaining=900624)
Filtering reads with a mismatch p-value <=0.01 (remaining=850576)
Filtering reads with low entropy (k-mer content >=60%) (remaining=836516)
Finding fusions and counting supporting readsWARNING: Some fusions were subsampled, because they have more than 300 supporting reads
(total=894138)
Merging adjacent fusion breakpoints (remaining=894078)
Estimating expected number of fusions by random chance (e-value)
Filtering fusions with both breakpoints in adjacent non-coding/intergenic regions (remaining=891730)
Filtering intragenic fusions with both breakpoints in exonic regions (remaining=886683)
Filtering fusions with <2 supporting reads (remaining=60344)
Filtering fusions with an e-value >=0.3 (remaining=16674)
Filtering fusions with both breakpoints in intronic/intergenic regions (remaining=16232)
Filtering PCR fusions between genes with an expression above the 99.8% quantile (remaining=4999)
Searching for fusions with spliced split reads (remaining=5021)
Selecting best breakpoints from genes with multiple breakpoints (remaining=523)
Searching for fusions with >=4 spliced events (remaining=527)
Filtering blacklisted fusions in '/n/app/bcbio/biodata/genomes/Hsapiens/GRCh37/rnaseq/fusion-blacklist/arriba-blacklist.tsv.gz' (remaining=147)
Filtering fusions with anchors <=23nt (remaining=91)
Filtering end-to-end fusions with low support (remaining=77)
Filtering fusions with no coverage around the breakpoints (remaining=69)
Indexing gene sequences
Filtering genes with >=30% identity (remaining=50)
Re-aligning chimeric reads to filter fusions with >=80% mis-mappers (remaining=25)
Selecting best breakpoints from genes with multiple breakpoints (remaining=22)
Searching for additional isoforms (remaining=23)
Assigning confidence scores to events
Writing fusions to file 'SRR1659952-fusions.tsv'
Writing discarded fusions to file 'SRR1659952-fusions.discarded.tsv'
I didn't notice it, but with -i
set it is just looking at 1
. Let me try it with commas instead of spaces for -i
.
Yup, that was it-- the help says it can be space or comma separated, but only comma separated works. Will file a bug and push a fix.
Oh boy. Thanks for investigating!
One feature request: could you expose the -k
parameter through the config YAML? Either similar to variant_regions
or coverage
as a way to feed fusion information to Arriba or (since it's Arriba specific) via the resources/options path. I think this will go a long way to rescue known variants.
Yep, will do.
Okey doke, hg38 is all set for this. I haven't tested it yet but it should work.
wow. Thanks a lot. I have been waiting for this since we use hg38.
If you have an hg38 STAR index made, you'll have to nuke it and re-make it so it makes a stripped down one without the ALTs/decoys/HLA regions.
nuke it and run bcbio_nextgen.py upgrade --data --genomes hg38 --aligners star
Testing it vs the previous reference data (if I can get the index built...)
Thanks, I'm sure there will be issues so let me know if anything doesn't work. I was torn about whether or not to include the decoy regions and decided against it; we can revisit that decision if it turns out having them is a good idea. I was just going by what the STAR manual suggests to provide.
Got it working. hg38 seems to be having issues with the blacklist; I am seeing an increase in fusions called. Re-running the same version across GRCh37 to verify.
Thanks-- I had to reformat the blacklist to match our chromosome naming, maybe arriba doesn't like that, am testing it out now.
On the good news side it works well with GRCh37 even without the added known fusion support. I've not yet tested this for hg19; holding off until hg38 works. Thanks for all your work on this, Rory.
Thanks-- sorry I didn't include any of the blacklist stuff for hg19, are you all using that? I can add it if folks are using it.
I'll upgrade to the current dev version and re-run to confirm. hg19 should be the same blacklist as GRCh37, think Arriba changes the naming convention internally.
Great, added support for hg19 as well. I also included all of the other files that are needed to generate the arriba plots as part of the data install.
Thanks, this is all set I think. I haven't added the plots yet, but I've tested it on a spike in dataset and it looks like it works well, so I think we have something solid in place we can improve on. Thanks everyone for pushing to get this integrated and for all the help testing!
Hello,
I find arriba very attractive for rna fusion.
https://github.com/suhrig/arriba
Should I try to add this myself or it's better to request to bcbio? Many thanks!