Open apsteinberg opened 5 months ago
Hi Asher,
Thanks for your continued interest in using Bambu.
Kind Regards, Andre Sim
Hi Andre,
This sounds great, I look forward to checking out the tutorial! For questions 2-3, this makes sense, thanks for explaining.
Thanks again for your time and help.
Best wishes, Asher
Hi Asher,
You can find the draft tutorial here. https://github.com/GoekeLab/sg-nex-data/blob/add_fusion_tutorial/docs/SG-NEx_IdentifyingFusionIsoform_tutorial.rmd
We still need to make some adjustments as when this was originally written the jaffal output was slightly different, however this should be enough to get you started if you are willing to make some adjustments yourself. I am happy to help with any small issues that may arise, otherwise we will be wanted to release a polished version in a couple of weeks, for which we can provide full support.
Kind Regards, Andre Sim
Hi Andre,
This looks great, thanks for sending this! I'm going to start looking into this in the next few days, and I'll plan on reaching out if I have further questions.
Best, Asher
Hi Andre,
I've gone through the tutorial which was super helpful! I do have a few follow-up questions though as I move onto applying this to my own data:
Thanks for your help.
Asher
Hi Asher,
You can use the output from JAFFAL. The outputs should be the same, I believe this was included in the tutorial to show how it could be done if you didn't have the JAFFAL output. Just make sure you use that fasta for the alignment and subsequent code.
Here is a helper function that can help you produce these files.
saveEnsemblTranscripts<-function(outputPrefix,genome, gtfFile='',gtf.dataSource='', host=NA,
dataset = "mmusculus_gene_ensembl",
biomart='ENSEMBL_MART_ENSEMBL')
{
## NOTE: ADD TRANSCRIPT ENSEMBL IDs WITH VERSION NUMBER (TX INDEX!)
## ensembl_transcript_id.version
show('Please Check Ensembl Version: http://genome.ucsc.edu/goldenpath/releaseLog.html')
library(biomaRt)
library(GenomicAlignments)
library(GenomicFeatures)
if(is.na(host)){
ensembl = useMart(biomart='ENSEMBL_MART_ENSEMBL',dataset=dataset) ## version 75
} else {
ensembl = useMart(host=host, biomart='ENSEMBL_MART_ENSEMBL',dataset=dataset) ## version 75
}
if(gtfFile!='')
{
txdbEnsGene= makeTxDbFromGFF(file=gtfFile, format="gtf", dataSource=gtf.dataSource, organism="Homo sapiens")
}
else
{
if(is.na(host)){
txdbEnsGene= makeTxDbFromBiomart(biomart=biomart,dataset=dataset,id_prefix="ensembl_")
} else {
txdbEnsGene= makeTxDbFromBiomart(biomart=biomart,dataset=dataset,id_prefix="ensembl_",host=host)
}
}
genesTxDb=genes(txdbEnsGene)
geneNamesTMP=names(genesTxDb)
if(any(grepl('\\.',geneNamesTMP))) # Gene_Id or gene_id plus version as in gencode
{
geneNamesTMP = gsub('(.*)\\..*','\\1',geneNamesTMP)
}
ensemblAnnotation=getBM(attributes=c('ensembl_gene_id',
ifelse(genome == 'Grch38','hgnc_symbol','mgi_symbol'),
'gene_biotype','description','chromosome_name','start_position','end_position','strand'), filters = 'ensembl_gene_id', values = geneNamesTMP, mart = ensembl)
ensemblAnnotation=merge(ensemblAnnotation,geneNamesTMP,by.x='ensembl_gene_id',by.y=1,all=T) # include any gene names that were missing from biomart, should not happen
ensemblAnnotation=ensemblAnnotation[!duplicated(ensemblAnnotation$ensembl_gene_id),] # remove duplicated entries for genes
rownames(ensemblAnnotation)=ensemblAnnotation$ensembl_gene_id
ensemblAnnotation=ensemblAnnotation[geneNamesTMP,] #reorder, select same as exon names
##########################ensembl### HERE ############################
# ----- Get Annotation for Transcripts -----
transcriptsTxDb = transcriptLengths(txdbEnsGene)
transcriptNamesTMP=transcriptsTxDb$tx_name
if(any(grepl('\\.',transcriptNamesTMP))) # Gene_Id or gene_id plus version as in gencode
{
transcriptNamesTMP = gsub('(.*)\\..*','\\1',transcriptNamesTMP)
}
ensemblAnnotationTranscripts=getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id'), filters = 'ensembl_transcript_id', values = transcriptNamesTMP, mart = ensembl)
ensemblAnnotationTranscripts=merge(ensemblAnnotationTranscripts,transcriptsTxDb,by.x='ensembl_transcript_id',by.y='tx_name',all=T) # include any gene names that were missing from biomart
ensemblAnnotationTranscripts=merge(ensemblAnnotationTranscripts,ensemblAnnotation,by='ensembl_gene_id',all.x=T,all.y=F) # Merge with gene annotations
ensemblAnnotationTranscripts=ensemblAnnotationTranscripts[!duplicated(ensemblAnnotationTranscripts$ensembl_transcript_id),]
rownames(ensemblAnnotationTranscripts)=ensemblAnnotationTranscripts$ensembl_transcript_id
ensemblAnnotationTranscripts=ensemblAnnotationTranscripts[transcriptNamesTMP,] #reorder, select same as exon names
write.table(ensemblAnnotation,file=paste(outputPrefix,'-genes.txt',sep=''),col.names=T,row.names=T,sep='\t',quote=F)
write.table(ensemblAnnotationTranscripts,file=paste(outputPrefix,'-transcripts.txt',sep=''),col.names=T,row.names=T,sep='\t',quote=F)
saveDb(txdbEnsGene,file=paste(outputPrefix,'-txdb.sqlite',sep=''))
}
Here is how you can run it for humans. Be sure to change the paths.
parameters <- list(genome = 'Grch38')
ensembl_version <- 91+c(0,3)[as.numeric(parameters[['genome']]=='Grch38')+1]
ensembl_CAP <- paste0(toupper(substr(parameters[['genome']],1,3)),substr(parameters[['genome']],4,6))
saveEnsemblTranscripts(outputPrefix = paste0('~/Desktop/temp/','/Homo_sapiens.',ensembl_CAP,'.',ensembl_version,'.annotations'),
genome = parameters[['genome']],
dataset = "hsapiens_gene_ensembl",
biomart='ENSEMBL_MART_ENSEMBL',
gtfFile = "~/Desktop/Homo_sapiens.GRCh38.91.sorted.gtf")
Hope this helps, Andre Sim
Hi there,
Thank you again for creating this wonderful tool! I had a question regarding your fusion gene analysis in your recent biorxiv paper: https://www.biorxiv.org/content/10.1101/2021.04.21.440736v1
In the methods, you describe aligning reads to a fusion gene annotation set from JAFFAL, then running bambu on these fusion alignments to discover fusion transcripts. I had a few questions about this:
1) Do you have a pipeline to do this, and if so, would you be willing to share the code? The results you show in Fig 7 of the paper are striking, and I would love to run such an analysis on the datasets I'm working with. 2) For the "fusion gene set" that you describe aligning to in the manuscript, is this just the cDNA fasta file output from JAFFAL? And did you somehow extract reads specific to these regions to perform this alignment or just align the entire fastq to these gene fusions? 3) When bambu was then run on the fusion alignments to discover fusion transcripts, was this done using the "fusion mode"? Further, was the reference gtf just ensembl or was bambu run in denovo mode here?
Thanks for your time and help.
Sincerely, Asher