Closed apsteinberg closed 11 months ago
Hi Asher,
Mostly you are correct that both ways will use the pretrained model, and that NDR cannot be calculated in denovo mode. However there are a few nuanced differences. When you provide the annotations, bambu can use that to assist in junction correction meaning the generated read classes are more likely to match annotations. It also helps in assigning gene ids, without which you may end up with amalgamated genes, which has impacts on how the model interprets the read classes. If you do have annotations, we would always recommend to run bambu with them even if you do not train the model using it.
Kind Regards, Andre Sim
Hi Andre,
Thank you this is very helpful to know, and answers my question. My one follow-up question is -- does junction correction still happen in denovo mode and if so, how is the probability of a true splicing junction predicted?
Cheers, Asher
Hi Asher,
Yes junction correct still happens in de novo mode. It categorizes junctions are high or low confidence based on the number of reads that support them and distance and support of neighboring junctions. This categorization is done using a similar pretrained model used by the de novo mode. Low confidence junctions are corrected to high confidence junctions if they fall within 10bp. This is an aspect of Bambu we do want to delve into more when we find the opportunity.
Hope this helps. Kind Regards, Andre Sim
Hi Andre,
Thanks this is super helpful! I'm realizing I have a couple more follow-up questions related to this (I hope it's okay to ask here, happy to correspond in another way):
1) The other thing I have been considering doing is modulating sensitivity of discovery post analysis with NDR scores as your team has shown here. Upon reading your manuscript I'm still a bit fuzzy on what the meaning of NDR for an individual transcript is. For a transcript i, is this essentially: NDR_i = 1-TPS_i, or is this not the case?
2) In regards to my first question about denovo transcript discovery. I feel like it would be more helpful if I was more specific about the samples I'm working with. I am working with cancer genomes, and I was hoping to use the GTEx long read annotation as a reference along with the pre-trained model (so the first option I listed in my initial comment). I was wondering if however this isn't a good idea, as this reference is collected from many tissue samples from across the body and my samples are collected from specific tissues. Basically, would there potentially be an issue with using this as my reference as there will be many annotated transcripts which aren't expressed in my tissue of interest and not found? How does bambu handle this type of situation? And could this artificially inflate the number of annotated transcripts found? Also, if there is a reference annotation you'd suggest using as a guide instead, please let me know. Hopefully this question is clear.
Thanks again for all your help.
Cheers, Asher
Hi Asher,
No problem replying here, as these are good questions, so others might have the same ones and can then find the answer here too.
For example if we have a transcript with an NDR of 0.2 and a TPS of x. This means that if we look at all transcripts with an TPS >= x (NDR <= 0.2), 80% will be annotated, and 20% will be novel. Therefore the NDR for an individual transcript reflects the scores and reference annotations of the analysis around it (specifically precision). The lower the NDR, the more highly ranked it is compared to other transcripts. The NDR is only the inverse of the TPS if NDR calculation cannot be done, such as when there are insufficient reference annotations and the prior sentence will no longer hold true. Does this clear it up for you?
Using a reference annotation that has unexpressed transcripts should not impact analysis in any great way. The main use of reference annotations is labeling read classes as being annotated or putative novel transcripts. It is very unlikely that an unexpressed transcript, would appear in the sample as noise and result in a false positive label. Otherwise, reference annotations can impact junction correction, but normally unexpressed transcripts still can inform the correct splice sites, for novel transcripts that share a site with it. I am not familiar with this particular reference and haven't used it myself, but I see that it was constructed using long-read data and FLAIR. I am moving into assumption territory here as I havn't evaluated this myself, but I would be concerned about using reference annotations generated from long-read data, regardless of the tool it was generated with, as input into Bambu if the new transcripts haven't been lab validated. I believe this would be much more likely to label long-read artifact noise as false positives and impact model training. Essentially this is amplifying errors that long-read data makes in transcript discovery. I would recommend using ensembl annotations or other well validated annotations if possible. If you are concerned these annotations are missing transcripts expressed in this particular cancer sample, you can note the recommended NDR that bambu suggests, which may be higher than a normal 0.1, which should hopefully result in these missing annotations being included. I do want to note for other use cases we would always recommend using the most validated annotations you have access too. However if you only have access to annotations generated from long read data, or from gene prediction software, that will likely still be significantly better than using no annotations so you do not need to worry.
That was a bit longer of an answer than I was expecting to write! Let me know if that came through clear.
Kind Regards, Andre Sim
Hi Andre,
Great, glad to hear it! This is super helpful as usual. I do have a few follow-up questions:
Thanks again for all your help!
Cheers, Asher
Hi Asher,
Kind Regards, Andre Sim
Hi Andre,
Cheers, Asher
Hi Asher,
Best of luck with your analysis, Andre
Hi Andre,
Great, thanks for explaining and thanks again for all of your advice!
Cheers, Asher
Hi there,
Thanks again for creating this great tool. I was wondering if you could clarify if there is a difference in the way transcripts are assembled when running Bambu using a pretrained model with this set of commands:
se <- bambu(reads = test.bam, annotations = annotations, genome = fa.file, opt.discovery = list(fitReadClassModel = FALSE))
As compared to running Bambu in "de-novo discovery mode" with this set of commands:
novelAnnotations <- bambu(reads = test.bam, annotations = NULL, genome = fa.file, NDR = 0.5, quant = FALSE)
From my reading of your documentation, it seems that both use the pre-trained model, but in the example of commands for running a pre-trained model, the NDR is computed by comparing how many novel transcripts are discovered relative to the reference annotation, whereas in "de-novo discovery mode", we are not able to compute NDR so TPS is used as a threshold instead. So the only difference between these two modes is how transcripts are filtered after assembly. Is this correct?
Thanks again for all your help!
Cheers, Asher
PS Sorry I still have yet to make much progress on the other open issue I have.