GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
190 stars 24 forks source link

Fusion gene analysis? #433

Open apsteinberg opened 5 months ago

apsteinberg commented 5 months ago

Hi there,

Thank you again for creating this wonderful tool! I had a question regarding your fusion gene analysis in your recent biorxiv paper: https://www.biorxiv.org/content/10.1101/2021.04.21.440736v1

In the methods, you describe aligning reads to a fusion gene annotation set from JAFFAL, then running bambu on these fusion alignments to discover fusion transcripts. I had a few questions about this:

1) Do you have a pipeline to do this, and if so, would you be willing to share the code? The results you show in Fig 7 of the paper are striking, and I would love to run such an analysis on the datasets I'm working with. 2) For the "fusion gene set" that you describe aligning to in the manuscript, is this just the cDNA fasta file output from JAFFAL? And did you somehow extract reads specific to these regions to perform this alignment or just align the entire fastq to these gene fusions? 3) When bambu was then run on the fusion alignments to discover fusion transcripts, was this done using the "fusion mode"? Further, was the reference gtf just ensembl or was bambu run in denovo mode here?

Thanks for your time and help.

Sincerely, Asher

andredsim commented 5 months ago

Hi Asher,

  1. We do have a draft of a tutorial on how to do this. Just give us a week to polish it up a bit and I will update you here once it is ready.
  2. Yes the fusion gene set uses the fasta file output from JAFFAL which is a scaffold of the two gene genomic regions concatenated next to each other (not cDNA). This means that splice junctions over the breakpoint can still be detected with mapping. Then we only map the reads that already mapped to these regions originally.
  3. Yes fusion mode was used, as when running bambu with the fusion inputs, the input data looks very different from a standard sequencing run so a lot of the assumptions bambu has are broken, and fusion mode circumvents this. We created a new gtf file custom to the breakpoint fasta file which combines the references for each fusion gene pair. The tutorial will cover how this is made.

Thanks for your continued interest in using Bambu.

Kind Regards, Andre Sim

apsteinberg commented 5 months ago

Hi Andre,

This sounds great, I look forward to checking out the tutorial! For questions 2-3, this makes sense, thanks for explaining.

Thanks again for your time and help.

Best wishes, Asher

andredsim commented 5 months ago

Hi Asher,

You can find the draft tutorial here. https://github.com/GoekeLab/sg-nex-data/blob/add_fusion_tutorial/docs/SG-NEx_IdentifyingFusionIsoform_tutorial.rmd

We still need to make some adjustments as when this was originally written the jaffal output was slightly different, however this should be enough to get you started if you are willing to make some adjustments yourself. I am happy to help with any small issues that may arise, otherwise we will be wanted to release a polished version in a couple of weeks, for which we can provide full support.

Kind Regards, Andre Sim

apsteinberg commented 5 months ago

Hi Andre,

This looks great, thanks for sending this! I'm going to start looking into this in the next few days, and I'll plan on reaching out if I have further questions.

Best, Asher

apsteinberg commented 4 months ago

Hi Andre,

I've gone through the tutorial which was super helpful! I do have a few follow-up questions though as I move onto applying this to my own data:

  1. Is it not a good idea to take the fasta file output from JAFFAL itself and use this to re-align the reads?
  2. My data has been aligned to Gencode v45. I'm wondering how you generated the gene and transcript annotation tsv files that are used to generate the fusion gene sequences? Or can these be found somewhere on the ensembl site?

Thanks for your help.

Asher

andredsim commented 4 months ago

Hi Asher,

  1. You can use the output from JAFFAL. The outputs should be the same, I believe this was included in the tutorial to show how it could be done if you didn't have the JAFFAL output. Just make sure you use that fasta for the alignment and subsequent code.

  2. Here is a helper function that can help you produce these files.

saveEnsemblTranscripts<-function(outputPrefix,genome, gtfFile='',gtf.dataSource='', host=NA,
                                 dataset = "mmusculus_gene_ensembl",
                                 biomart='ENSEMBL_MART_ENSEMBL')
{
  ## NOTE: ADD TRANSCRIPT ENSEMBL IDs WITH VERSION NUMBER (TX INDEX!)
  ## ensembl_transcript_id.version
  show('Please Check Ensembl Version: http://genome.ucsc.edu/goldenpath/releaseLog.html')
  library(biomaRt)
  library(GenomicAlignments)
  library(GenomicFeatures)
  if(is.na(host)){
    ensembl = useMart(biomart='ENSEMBL_MART_ENSEMBL',dataset=dataset) ## version 75
  } else {
    ensembl = useMart(host=host, biomart='ENSEMBL_MART_ENSEMBL',dataset=dataset) ## version 75
  }

  if(gtfFile!='')
  {
    txdbEnsGene= makeTxDbFromGFF(file=gtfFile, format="gtf",  dataSource=gtf.dataSource, organism="Homo sapiens")
  }
  else
  {
    if(is.na(host)){
      txdbEnsGene= makeTxDbFromBiomart(biomart=biomart,dataset=dataset,id_prefix="ensembl_")
    } else {
      txdbEnsGene= makeTxDbFromBiomart(biomart=biomart,dataset=dataset,id_prefix="ensembl_",host=host)
    }

  }
  genesTxDb=genes(txdbEnsGene)
  geneNamesTMP=names(genesTxDb)
  if(any(grepl('\\.',geneNamesTMP)))  # Gene_Id or gene_id plus version as in gencode
  {
    geneNamesTMP = gsub('(.*)\\..*','\\1',geneNamesTMP)
  }
  ensemblAnnotation=getBM(attributes=c('ensembl_gene_id',
                                       ifelse(genome == 'Grch38','hgnc_symbol','mgi_symbol'),
                                       'gene_biotype','description','chromosome_name','start_position','end_position','strand'), filters = 'ensembl_gene_id', values = geneNamesTMP, mart = ensembl)
  ensemblAnnotation=merge(ensemblAnnotation,geneNamesTMP,by.x='ensembl_gene_id',by.y=1,all=T) # include any gene names that were missing from biomart, should not happen
  ensemblAnnotation=ensemblAnnotation[!duplicated(ensemblAnnotation$ensembl_gene_id),]  # remove duplicated entries for genes
  rownames(ensemblAnnotation)=ensemblAnnotation$ensembl_gene_id
  ensemblAnnotation=ensemblAnnotation[geneNamesTMP,]  #reorder, select same as exon names
  ##########################ensembl### HERE ############################
  # ----- Get Annotation for Transcripts -----
  transcriptsTxDb = transcriptLengths(txdbEnsGene)
  transcriptNamesTMP=transcriptsTxDb$tx_name
  if(any(grepl('\\.',transcriptNamesTMP)))  # Gene_Id or gene_id plus version as in gencode
  {
    transcriptNamesTMP = gsub('(.*)\\..*','\\1',transcriptNamesTMP)
  }
  ensemblAnnotationTranscripts=getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id'), filters = 'ensembl_transcript_id', values = transcriptNamesTMP, mart = ensembl)
  ensemblAnnotationTranscripts=merge(ensemblAnnotationTranscripts,transcriptsTxDb,by.x='ensembl_transcript_id',by.y='tx_name',all=T) # include any gene names that were missing from biomart
  ensemblAnnotationTranscripts=merge(ensemblAnnotationTranscripts,ensemblAnnotation,by='ensembl_gene_id',all.x=T,all.y=F) # Merge with gene annotations
  ensemblAnnotationTranscripts=ensemblAnnotationTranscripts[!duplicated(ensemblAnnotationTranscripts$ensembl_transcript_id),]
  rownames(ensemblAnnotationTranscripts)=ensemblAnnotationTranscripts$ensembl_transcript_id
  ensemblAnnotationTranscripts=ensemblAnnotationTranscripts[transcriptNamesTMP,]  #reorder, select same as exon names
  write.table(ensemblAnnotation,file=paste(outputPrefix,'-genes.txt',sep=''),col.names=T,row.names=T,sep='\t',quote=F)
  write.table(ensemblAnnotationTranscripts,file=paste(outputPrefix,'-transcripts.txt',sep=''),col.names=T,row.names=T,sep='\t',quote=F)
  saveDb(txdbEnsGene,file=paste(outputPrefix,'-txdb.sqlite',sep=''))
}

Here is how you can run it for humans. Be sure to change the paths.

parameters <- list(genome = 'Grch38')
ensembl_version <- 91+c(0,3)[as.numeric(parameters[['genome']]=='Grch38')+1]
ensembl_CAP <- paste0(toupper(substr(parameters[['genome']],1,3)),substr(parameters[['genome']],4,6))
saveEnsemblTranscripts(outputPrefix = paste0('~/Desktop/temp/','/Homo_sapiens.',ensembl_CAP,'.',ensembl_version,'.annotations'),
                       genome = parameters[['genome']],
                       dataset = "hsapiens_gene_ensembl",
                       biomart='ENSEMBL_MART_ENSEMBL',
                       gtfFile = "~/Desktop/Homo_sapiens.GRCh38.91.sorted.gtf")

Hope this helps, Andre Sim