bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
993 stars 354 forks source link

Add support for Arriba the DREAM fusion calling winner #2751

Closed dauss75 closed 5 years ago

dauss75 commented 5 years ago

Hello,

I find arriba very attractive for rna fusion.
https://github.com/suhrig/arriba

Should I try to add this myself or it's better to request to bcbio? Many thanks!

roryk commented 5 years ago

Hi Segun,

Nice one, that looks pretty awesome. We are definitely up for taking pull requests but we could also work on adding it together. The first step would be running and making sure it actually runs and we can reproduce the results. Then we can take what we did and formalize it in bcbio. Have you run arriba yet? I think we already produce the STAR alignments with the proper flags so it would be nice to figure out if we could hook up what we are already doing to arriba.

dauss75 commented 5 years ago

Thanks Rory. We use bcbio pipeline and my colleague had been testing in parallel hoping to integrate with bcbio later if we observe decent performance. So far the results are promising and outperforming the two fusion tools available in bcbio for our target genes. Please let me know if you need any help from my end.

roryk commented 5 years ago

Thanks, Sebastian was super nice and updated the bioconda recipe so it will be compatible with bcbio, so that's one step down.

dauss75 commented 5 years ago

Thanks Rory for the follow-up. It's very exciting to see the progress. Please keep me posted. Regards, Segun

roryk commented 5 years ago

I have this functional for GRCh37 and am testing it now on https://www.ncbi.nlm.nih.gov/bioproject/PRJNA252360/ to make sure the results are reasonable on a toy dataset. @areddy965 has some nice real samples to test it on once we finish having it integrated.

dauss75 commented 5 years ago

Thanks so much for the hard work. That is really exciting news! Can’t wait to try this tool in bcbio.

roryk commented 5 years ago

I have implemented support for arriba with GRCh37 only. If you update your data, installation and tools with:

bcbio_nextgen.py upgrade -u development --tools --data --genomes GRCh37

it will pull in the fusion blacklist files from arriba and will run arriba if you use STAR as the aligner and add arriba to the fusion_callers list. I've run it and it is functional, but haven't looked at the output yet to see if it is doing a reasonable job. But this is a good starting point for us to figure out if it is doing a good job and how we can tweak it to doing better. I'm out tomorrow but will look at the results later on this week, and am now working on hg38 support, which is going to be more involved as arriba needs the STAR aligner and STAR doesn't support hg38 with the alts, so we have to add an alt-free STAR index.

ohofmann commented 5 years ago

That's awesome, thanks Rory. We'll have a look at it as well; I'm particularly interested to see if it can handle input from other aligners, particularly DRAGEN, which would allow us to support hg38 that way.

dauss75 commented 5 years ago

Awesome. on my vacation, but I am super excited to give a try when returning to work.

dauss75 commented 5 years ago

got updated and can see arriba is available. thanks. Should I simply add arriba under fusion_caller in yaml and it would work?

roryk commented 5 years ago

Hi Segun,

Yup that will do it-- when I'm done with the full release I'll update all the documentation.

dauss75 commented 5 years ago

awesome

ohofmann commented 5 years ago

Works well - it produces less noise than Pizzly, verdict is out whether it's also as sensitive. I can only test it on samples with single (driver) fusions; would love to get my hands on a public test set.

ohofmann commented 5 years ago

Alright. It works functionally and calls some events, but missed 5/5 driver events that I've now looked at, some with strong evidence and called by Pizzly. Even looking at the unfiltered output Pizzly has basic support for some of the fusions where Arriba sees 0/0 splitread/discordant support.

I'll try to run it outside of bcbio to see if I can get different results that way.

ohofmann commented 5 years ago

Running the same sample with the defaults outside of bcbio gives different results that match the expectations. This is with the same STAR install and FASTQs. If the blacklist is the same I would suspect that the GTF might be source (GENCODE19 vs bcbio's ref-transcripts.gtf)?

roryk commented 5 years ago

Thanks Oliver, looping in our talk over Slack here:

Running Arriba with just a:
./arriba -x ../MDX190048_RNA004663_B_ALL_Case_6_T_RNA-ready.bam -o fusions.tsv -O fusions.discarded.tsv -a GRCh37.fa -g ref-transcripts.gtf -b database/blacklist_hg19_hs37d5_GRCh37_2018-11-04.tsv.gz -T -P
Finds the missed fusion and ~20 others that have strong support. That's the bcbio-generated STAR BAM, the bcbio gtf, and their blacklist.

I'm checking right now if the -i parameter is bugged, and that's what is causing uninteresting_contigs to get set. https://arriba.readthedocs.io/en/latest/internal-algorithm/#read-level-filters describes what uninteresting_contigs means.

roryk commented 5 years ago

Ok! I think that is it:

with -i set

Loading annotation from '/n/app/bcbio/biodata/genomes/Hsapiens/GRCh37/rnaseq/ref-transcripts.gtf'
Loading assembly from '/n/app/bcbio/biodata/genomes/Hsapiens/GRCh37/seq/GRCh37.fa'
Reading chimeric alignments from '../align/SRR1659952/SRR1659952_star/SRR1659952.bam' (total=2882489)
Filtering multi-mappers and single mates (remaining=2882489)
Detecting strandedness (no)
Annotating alignments
Filtering duplicates (remaining=2142320)
Filtering mates which do not map to interesting contigs (1) (remaining=73956)
Estimating mate gap distributionWARNING: not enough chimeric reads to estimate mate gap distribution, using default values
Filtering read-through fragments with a distance <=10000bp (remaining=71667)
Filtering inconsistently clipped mates (remaining=71446)
Filtering breakpoints adjacent to homopolymers >=6nt (remaining=71172)
Filtering fragments with small insert size (remaining=13468)
Filtering alignments with long gaps (remaining=13468)
Filtering fragments with both mates in the same gene (remaining=13280)
Filtering fusions arising from hairpin structures (remaining=11775)
Filtering reads with a mismatch p-value <=0.01 (remaining=10967)
Filtering reads with low entropy (k-mer content >=60%) (remaining=10836)
Finding fusions and counting supporting readsWARNING: Some fusions were subsampled, because they have more than 300 supporting reads
 (total=11361)
Merging adjacent fusion breakpoints (remaining=11354)
Estimating expected number of fusions by random chance (e-value)
Filtering fusions with both breakpoints in adjacent non-coding/intergenic regions (remaining=11026)
Filtering intragenic fusions with both breakpoints in exonic regions (remaining=10467)
Filtering fusions with <2 supporting reads (remaining=1682)
Filtering fusions with an e-value >=0.3 (remaining=814)
Filtering fusions with both breakpoints in intronic/intergenic regions (remaining=784)
Filtering PCR fusions between genes with an expression above the 99.8% quantile (remaining=782)
Searching for fusions with spliced split reads (remaining=783)
Selecting best breakpoints from genes with multiple breakpoints (remaining=69)
Searching for fusions with >=4 spliced events (remaining=69)
Filtering blacklisted fusions in '/n/app/bcbio/biodata/genomes/Hsapiens/GRCh37/rnaseq/fusion-blacklist/arriba-blacklist.tsv.gz' (remaining=12)
Filtering fusions with anchors <=23nt (remaining=12)
Filtering end-to-end fusions with low support (remaining=11)
Filtering fusions with no coverage around the breakpoints (remaining=11)
Indexing gene sequences
Filtering genes with >=30% identity (remaining=9)
Re-aligning chimeric reads to filter fusions with >=80% mis-mappers (remaining=6)
Selecting best breakpoints from genes with multiple breakpoints (remaining=6)
Searching for additional isoforms (remaining=6)
Assigning confidence scores to events
Writing fusions to file 'SRR1659952-fusions.tsv'
Writing discarded fusions to file 'SRR1659952-fusions.discarded.tsv'

without -i set

Loading annotation from '/n/app/bcbio/biodata/genomes/Hsapiens/GRCh37/rnaseq/ref-transcripts.gtf'
Loading assembly from '/n/app/bcbio/biodata/genomes/Hsapiens/GRCh37/seq/GRCh37.fa'
Reading chimeric alignments from '../align/SRR1659952/SRR1659952_star/SRR1659952.bam'
 (total=2882489)
Filtering multi-mappers and single mates (remaining=2882489)
Detecting strandedness (no)
Annotating alignments
Filtering duplicates (remaining=2142320)
Filtering mates which do not map to interesting contigs (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y) (remaining=1543505)
Estimating mate gap distribution (mean=41.196, stddev=28.0094)
Filtering read-through fragments with a distance <=10000bp (remaining=1514872)
Filtering inconsistently clipped mates (remaining=1509037)
Filtering breakpoints adjacent to homopolymers >=6nt (remaining=1474665)
Filtering fragments with small insert size (remaining=917037)
Filtering alignments with long gaps (remaining=917037)
Filtering fragments with both mates in the same gene (remaining=915470)
Filtering fusions arising from hairpin structures (remaining=900624)
Filtering reads with a mismatch p-value <=0.01 (remaining=850576)
Filtering reads with low entropy (k-mer content >=60%) (remaining=836516)
Finding fusions and counting supporting readsWARNING: Some fusions were subsampled, because they have more than 300 supporting reads
 (total=894138)
Merging adjacent fusion breakpoints (remaining=894078)
Estimating expected number of fusions by random chance (e-value)
Filtering fusions with both breakpoints in adjacent non-coding/intergenic regions (remaining=891730)
Filtering intragenic fusions with both breakpoints in exonic regions (remaining=886683)
Filtering fusions with <2 supporting reads (remaining=60344)
Filtering fusions with an e-value >=0.3 (remaining=16674)
Filtering fusions with both breakpoints in intronic/intergenic regions (remaining=16232)
Filtering PCR fusions between genes with an expression above the 99.8% quantile (remaining=4999)
Searching for fusions with spliced split reads (remaining=5021)
Selecting best breakpoints from genes with multiple breakpoints (remaining=523)
Searching for fusions with >=4 spliced events (remaining=527)
Filtering blacklisted fusions in '/n/app/bcbio/biodata/genomes/Hsapiens/GRCh37/rnaseq/fusion-blacklist/arriba-blacklist.tsv.gz' (remaining=147)
Filtering fusions with anchors <=23nt (remaining=91)
Filtering end-to-end fusions with low support (remaining=77)
Filtering fusions with no coverage around the breakpoints (remaining=69)
Indexing gene sequences
Filtering genes with >=30% identity (remaining=50)
Re-aligning chimeric reads to filter fusions with >=80% mis-mappers (remaining=25)
Selecting best breakpoints from genes with multiple breakpoints (remaining=22)
Searching for additional isoforms (remaining=23)
Assigning confidence scores to events
Writing fusions to file 'SRR1659952-fusions.tsv'
Writing discarded fusions to file 'SRR1659952-fusions.discarded.tsv'

I didn't notice it, but with -i set it is just looking at 1. Let me try it with commas instead of spaces for -i.

roryk commented 5 years ago

Yup, that was it-- the help says it can be space or comma separated, but only comma separated works. Will file a bug and push a fix.

ohofmann commented 5 years ago

Oh boy. Thanks for investigating!

ohofmann commented 5 years ago

One feature request: could you expose the -k parameter through the config YAML? Either similar to variant_regions or coverage as a way to feed fusion information to Arriba or (since it's Arriba specific) via the resources/options path. I think this will go a long way to rescue known variants.

roryk commented 5 years ago

Yep, will do.

roryk commented 5 years ago

Okey doke, hg38 is all set for this. I haven't tested it yet but it should work.

dauss75 commented 5 years ago

wow. Thanks a lot. I have been waiting for this since we use hg38.

roryk commented 5 years ago

If you have an hg38 STAR index made, you'll have to nuke it and re-make it so it makes a stripped down one without the ALTs/decoys/HLA regions.

roryk commented 5 years ago

nuke it and run bcbio_nextgen.py upgrade --data --genomes hg38 --aligners star

ohofmann commented 5 years ago

Testing it vs the previous reference data (if I can get the index built...)

roryk commented 5 years ago

Thanks, I'm sure there will be issues so let me know if anything doesn't work. I was torn about whether or not to include the decoy regions and decided against it; we can revisit that decision if it turns out having them is a good idea. I was just going by what the STAR manual suggests to provide.

ohofmann commented 5 years ago

Got it working. hg38 seems to be having issues with the blacklist; I am seeing an increase in fusions called. Re-running the same version across GRCh37 to verify.

roryk commented 5 years ago

Thanks-- I had to reformat the blacklist to match our chromosome naming, maybe arriba doesn't like that, am testing it out now.

ohofmann commented 5 years ago

On the good news side it works well with GRCh37 even without the added known fusion support. I've not yet tested this for hg19; holding off until hg38 works. Thanks for all your work on this, Rory.

roryk commented 5 years ago

Thanks-- sorry I didn't include any of the blacklist stuff for hg19, are you all using that? I can add it if folks are using it.

ohofmann commented 5 years ago

I'll upgrade to the current dev version and re-run to confirm. hg19 should be the same blacklist as GRCh37, think Arriba changes the naming convention internally.

roryk commented 5 years ago

Great, added support for hg19 as well. I also included all of the other files that are needed to generate the arriba plots as part of the data install.

roryk commented 5 years ago

Thanks, this is all set I think. I haven't added the plots yet, but I've tested it on a spike in dataset and it looks like it works well, so I think we have something solid in place we can improve on. Thanks everyone for pushing to get this integrated and for all the help testing!