Using pre-trained model vs de-novo transcript discovery

apsteinberg commented 11 months ago

Hi there,

Thanks again for creating this great tool. I was wondering if you could clarify if there is a difference in the way transcripts are assembled when running Bambu using a pretrained model with this set of commands:

se <- bambu(reads = test.bam, annotations = annotations, genome = fa.file, opt.discovery = list(fitReadClassModel = FALSE))

As compared to running Bambu in "de-novo discovery mode" with this set of commands:

novelAnnotations <- bambu(reads = test.bam, annotations = NULL, genome = fa.file, NDR = 0.5, quant = FALSE)

From my reading of your documentation, it seems that both use the pre-trained model, but in the example of commands for running a pre-trained model, the NDR is computed by comparing how many novel transcripts are discovered relative to the reference annotation, whereas in "de-novo discovery mode", we are not able to compute NDR so TPS is used as a threshold instead. So the only difference between these two modes is how transcripts are filtered after assembly. Is this correct?

Thanks again for all your help!

Cheers, Asher

PS Sorry I still have yet to make much progress on the other open issue I have.

andredsim commented 11 months ago

Hi Asher,

Mostly you are correct that both ways will use the pretrained model, and that NDR cannot be calculated in denovo mode. However there are a few nuanced differences. When you provide the annotations, bambu can use that to assist in junction correction meaning the generated read classes are more likely to match annotations. It also helps in assigning gene ids, without which you may end up with amalgamated genes, which has impacts on how the model interprets the read classes. If you do have annotations, we would always recommend to run bambu with them even if you do not train the model using it.

Kind Regards, Andre Sim

apsteinberg commented 11 months ago

Hi Andre,

Thank you this is very helpful to know, and answers my question. My one follow-up question is -- does junction correction still happen in denovo mode and if so, how is the probability of a true splicing junction predicted?

Cheers, Asher

andredsim commented 11 months ago

Hi Asher,

Yes junction correct still happens in de novo mode. It categorizes junctions are high or low confidence based on the number of reads that support them and distance and support of neighboring junctions. This categorization is done using a similar pretrained model used by the de novo mode. Low confidence junctions are corrected to high confidence junctions if they fall within 10bp. This is an aspect of Bambu we do want to delve into more when we find the opportunity.

Hope this helps. Kind Regards, Andre Sim

apsteinberg commented 11 months ago

Hi Andre,

Thanks this is super helpful! I'm realizing I have a couple more follow-up questions related to this (I hope it's okay to ask here, happy to correspond in another way):

1) The other thing I have been considering doing is modulating sensitivity of discovery post analysis with NDR scores as your team has shown here. Upon reading your manuscript I'm still a bit fuzzy on what the meaning of NDR for an individual transcript is. For a transcript i, is this essentially: NDR_i = 1-TPS_i, or is this not the case?

2) In regards to my first question about denovo transcript discovery. I feel like it would be more helpful if I was more specific about the samples I'm working with. I am working with cancer genomes, and I was hoping to use the GTEx long read annotation as a reference along with the pre-trained model (so the first option I listed in my initial comment). I was wondering if however this isn't a good idea, as this reference is collected from many tissue samples from across the body and my samples are collected from specific tissues. Basically, would there potentially be an issue with using this as my reference as there will be many annotated transcripts which aren't expressed in my tissue of interest and not found? How does bambu handle this type of situation? And could this artificially inflate the number of annotated transcripts found? Also, if there is a reference annotation you'd suggest using as a guide instead, please let me know. Hopefully this question is clear.

Thanks again for all your help.

Cheers, Asher

andredsim commented 11 months ago

Hi Asher,

No problem replying here, as these are good questions, so others might have the same ones and can then find the answer here too.

For example if we have a transcript with an NDR of 0.2 and a TPS of x. This means that if we look at all transcripts with an TPS >= x (NDR <= 0.2), 80% will be annotated, and 20% will be novel. Therefore the NDR for an individual transcript reflects the scores and reference annotations of the analysis around it (specifically precision). The lower the NDR, the more highly ranked it is compared to other transcripts. The NDR is only the inverse of the TPS if NDR calculation cannot be done, such as when there are insufficient reference annotations and the prior sentence will no longer hold true. Does this clear it up for you?
Using a reference annotation that has unexpressed transcripts should not impact analysis in any great way. The main use of reference annotations is labeling read classes as being annotated or putative novel transcripts. It is very unlikely that an unexpressed transcript, would appear in the sample as noise and result in a false positive label. Otherwise, reference annotations can impact junction correction, but normally unexpressed transcripts still can inform the correct splice sites, for novel transcripts that share a site with it. I am not familiar with this particular reference and haven't used it myself, but I see that it was constructed using long-read data and FLAIR. I am moving into assumption territory here as I havn't evaluated this myself, but I would be concerned about using reference annotations generated from long-read data, regardless of the tool it was generated with, as input into Bambu if the new transcripts haven't been lab validated. I believe this would be much more likely to label long-read artifact noise as false positives and impact model training. Essentially this is amplifying errors that long-read data makes in transcript discovery. I would recommend using ensembl annotations or other well validated annotations if possible. If you are concerned these annotations are missing transcripts expressed in this particular cancer sample, you can note the recommended NDR that bambu suggests, which may be higher than a normal 0.1, which should hopefully result in these missing annotations being included. I do want to note for other use cases we would always recommend using the most validated annotations you have access too. However if you only have access to annotations generated from long read data, or from gene prediction software, that will likely still be significantly better than using no annotations so you do not need to worry.

That was a bit longer of an answer than I was expecting to write! Let me know if that came through clear.

Kind Regards, Andre Sim

apsteinberg commented 11 months ago

Hi Andre,

Great, glad to hear it! This is super helpful as usual. I do have a few follow-up questions:

Thanks for elaborating on this point. However, it is still a bit fuzzy to me. So basically each individual transcript i has a TPS_i = x, and TPS_i is independent of the scores and reference annotations of the analysis around it. But for calculating the NDR score of an individual transcript i (NDR_i), this is found by finding the optimal threshold p <= TPS_i such that NDR_p <= NDR_i, where NDR_p = V_p/R_p. And here, V_p is the number of valid, novel transcripts for the given threshold p and R_p is the total number of valid transcripts for that threshold p. So this last calculation of NDR_p is what involves all the other transcripts in the analysis. Is this correct?
Thanks for these details as well, this helps a lot. This makes sense that using a reference annotation generated from long-read data which hasn't been lab validated could amplify errors typical of using long-read for transcript discovery. You are correct, that I had been concerned that ensembl or other well-validated annotations would be missing transcripts expressed in the cancer samples I am working with. What you suggested regarding noting the recommended NDR makes a lot of sense, and then I can use this as a guide as a modulate the NDR post-discovery. I think since these are cancer samples from humans, it likely then makes sense to use ensembl annotations (or gencode annotations, which are essentially the same from my understanding?) along with the pre-trained model as we expect many annotated transcripts to be missing. Please let me know if you'd advise otherwise.

Thanks again for all your help!

Cheers, Asher

andredsim commented 11 months ago

Hi Asher,

Yes I believe thats almost correct. Just a few minor corrections to make sure we are on the same page. R_p is the total number of read classes (both valid and invalid) with a TPS < p. Yes the last NDR_p is which involves the other transcripts in the analysis, but its only those that are < p, so adding a number of bottom ranked invalid read classes would never impact the NDR calibration of any read class with a greater TPS. Hope that clears up the last fuzziness.
Yes either gencode or ensembl is fine. I would probably by default not use the pretrained model as a first pass, as the benefits to training the model specifically on your data outweighs the mislabeling from the missing transcripts. In supplimentary figure 1f you can see the pretrained model performed similarly to a situation where 25% of the annotations are missing. I would assume in your cancer case, its likely less than 25% of the annotations are missing. However if the recommended NDR is > 0.25, than perhaps the pretrained model will perform better in your case.

Kind Regards, Andre Sim

apsteinberg commented 11 months ago

Hi Andre,

Great, thanks. However, why is it not the case that R_p includes all read classes with TPS > p? I thought otherwise it would be the case that adding a number of bottom ranked transcripts would impact NDR calibration?
This sounds great, thank you for your suggestion!

Cheers, Asher

andredsim commented 11 months ago

Hi Asher,

Sorry I made a typo. R_p is all read classes with TPS > p, (the '<' was incorrect)

Best of luck with your analysis, Andre

apsteinberg commented 11 months ago

Hi Andre,

Great, thanks for explaining and thanks again for all of your advice!

Cheers, Asher

GoekeLab / bambu

Using pre-trained model vs de-novo transcript discovery #404