GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
190 stars 24 forks source link

How to extract transcript assembled from the RNA-seq data? #438

Closed dudududu12138 closed 4 months ago

dudududu12138 commented 4 months ago

Hi, I noticed that your default output annotation is an extended annotation(reference+novel). While I want to get the annotation that assemble from the RNA-seq data. That is, I want to get the transcripts the sample expressed without quantification step. I can just discover transcript only with your codes below while the output gtf file is still an extended annotation. It seems that I can only distinguished the expressed transcripts from quantification. But you discover transcripts also with the raw RNA-seq file. I want to get this raw assemble output. Thanks!

se.discoveryOnly <- bambu(reads = sample, annotations = annotations, genome = fa.file, quant = FALSE)
writeToGTF(se.discoveryOnly, "./output.gtf")

By the way, I converted the GRanges file to data frame and but I don't know the meanings of different columns.Could you explain the meaning of readCount ,relReadCount and relSubsetCount?

z<-as.data.frame(mcols(se.discoveryOnly ))
head(z)

1720445601866

andredsim commented 4 months ago

Hi, thanks for using Bambu.

If you want to filter the output to only include transcripts where there was full-length read support detected by Bambu follow the steps in this part of the documentation which would require you to also do the quantification step. https://github.com/GoekeLab/bambu?tab=readme-ov-file#Output-Description

se.novel = se[assays(se)$fullLengthCounts >= 1,]
writeBambuOutput(se.novel, path = "./bambu/")

Alternatively you can filter by the readCount column that you highlighted >= 1. That should remove reference annotations that were not detected.

The description of the readCount, relReadCount and relSubsetCount can be found in our documentation. I do want to point out the last two columns are best ignored and are present mainly for the fusion mode function. https://github.com/GoekeLab/bambu?tab=readme-ov-file#Output-Description

Hope this helps. If you have any further questions/issues please feel free to reopen this ticket or create a new one.

Kind Regards, Andre Sim

dudududu12138 commented 4 months ago

Alternatively you can filter by the readCount column that you highlighted >= 1. That should remove reference annotations that were not detected.

Thanks for your reply. That is what I need.

dudududu12138 commented 4 months ago

Hi, thanks for using Bambu.

If you want to filter the output to only include transcripts where there was full-length read support detected by Bambu follow the steps in this part of the documentation which would require you to also do the quantification step. https://github.com/GoekeLab/bambu?tab=readme-ov-file#Output-Description

se.novel = se[assays(se)$fullLengthCounts >= 1,]
writeBambuOutput(se.novel, path = "./bambu/")

Alternatively you can filter by the readCount column that you highlighted >= 1. That should remove reference annotations that were not detected.

The description of the readCount, relReadCount and relSubsetCount can be found in our documentation. I do want to point out the last two columns are best ignored and are present mainly for the fusion mode function. https://github.com/GoekeLab/bambu?tab=readme-ov-file#Output-Description

Hope this helps. If you have any further questions/issues please feel free to reopen this ticket or create a new one.

Kind Regards, Andre Sim

Sorry, I met a new problem. I tested the difference between the assemble result of doing quantification and not doing quantification. The number of transcripts differs too much. Below is my code and the result.

1.extract transcripts with quantification step

bambuAnnotations <- prepareAnnotations(ref_anno)
se<-bambu(reads=bam, annotations=bambuAnnotations, genome=ref,quant=TRUE,discovery=TRUE,trackReads=TRUE,yieldSize=1e8)
writeBambuOutput(se, path=out)

Then I counts transcripts supported by full length reads, there are 14481 transcripts.

awk '$3>0' fullLengthCounts_transcript.txt | wc -l

2.extract transcripts without quantification step

bambuAnnotations <- prepareAnnotations(ref_anno)
se<-bambu(reads=bam, annotations=bambuAnnotations, genome=ref,quant=FALSE,discovery=TRUE,trackReads=TRUE,yieldSize=1e8)
result<-se[!is.na(mcols(se)$readCount) & mcols(se)$readCount >0,]
writeToGTF(result, out)

I also count the number of transcript and there are only 2180 transcripts.

awk '$3=="transcript"' noquant.gtf | wc -l

So which one should I choose? I just want to get the transcript assembled with this RNA-seq data. It seems that the discovery step and the quantification step differ so much.

cying111 commented 4 months ago

Hi @dudududu12138 ,

Thanks for using Bambu.

The reason that you are observing different number of transcripts is because the filtering seems to be the same but actually is very different on those two outputs.

The filtering with the quantification step filters out non-full-length read supported transcripts, including annotated transcripts While the filtering without the quantification step filters out annotated transcripts, because we provide read count for novel transcripts only at this step

The correct way of comparing results from both would be that in the filtering step with the quantification, you can also filter out the annotated transcripts by `se<-bambu(reads=bam, annotations=bambuAnnotations, genome=ref,quant=TRUE,discovery=TRUE,trackReads=TRUE,yieldSize=1e8) seFiltered <- se[grepl("BambuTx", rownames(se))] writeBambuOutput(se, path=out) ' This should be giving you similar number of transcripts.

But if you want the assembled transcripts for the RNA-seq data including both novel and annotated transcripts, you should probably use the one with quantification.

Thank you and hope this has clarified your question. Let us know if you have additional questions.

Thank you Warm regards, Ying

dudududu12138 commented 4 months ago

Hi @dudududu12138 ,

Thanks for using Bambu.

The reason that you are observing different number of transcripts is because the filtering seems to be the same but actually is very different on those two outputs.

The filtering with the quantification step filters out non-full-length read supported transcripts, including annotated transcripts While the filtering without the quantification step filters out annotated transcripts, because we provide read count for novel transcripts only at this step

The correct way of comparing results from both would be that in the filtering step with the quantification, you can also filter out the annotated transcripts by `se<-bambu(reads=bam, annotations=bambuAnnotations, genome=ref,quant=TRUE,discovery=TRUE,trackReads=TRUE,yieldSize=1e8) seFiltered <- se[grepl("BambuTx", rownames(se))] writeBambuOutput(se, path=out) ' This should be giving you similar number of transcripts.

But if you want the assembled transcripts for the RNA-seq data including both novel and annotated transcripts, you should probably use the one with quantification.

Thank you and hope this has clarified your question. Let us know if you have additional questions.

Thank you Warm regards, Ying

Thank you very much! I have no question now.