GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
182 stars 22 forks source link

Merging read classes across samples results in fewer novel transcripts #448

Open apsteinberg opened 1 week ago

apsteinberg commented 1 week ago

Hi Andre and the Bambu team,

I am encountering an issue related to issue #444 . I merged read classes across 44 samples as you had suggested and found that there were about 1000 novel transcript isoforms discovered. Originally, I had analyzed 9 of these samples using the same read classes, and I found there were 1300 novel transcript isoforms. In both cases, I fixed NDR = 0.1.

Why am I finding fewer novel transcript isoforms with the 44 samples vs the 9 samples? From my reading of your user manual on github. I thought that in multisample mode the samples were still analyzed independently, but then the merge simply unified novel transcript IDs across samples. Is it possible that somehow some of the putative novel transcripts ended up merged with canonical transcripts that were detected in the larger analysis? Further, how is this merge performed and how does it differ from stringtie's merge method? I know you mentioned it would increase my false positive rate to use the latter, but it is unclear to me how.

Thanks for your time and help.

Best, Asher

andredsim commented 1 week ago

Hi Asher,

Yes this might seem a bit unintuitive but it is expected behavior.

For background - We found that during multiple sample analysis as you add more and more reads you greatly increase the noise meaning more false positive transcripts are reported as novel (resulting in less accurate transcript discovery with more data). Therefore we use the presence of known annotations in the ranking of transcript candidates to determine how many novel transcripts we should return in order to control for this sample size/number bias. An NDR of 0.1 means a prediction score threshold is chosen where 10% of all transcripts below this threshold are novel.

If you add more samples and the ratio of highly scored novel transcripts increases over highly scored known transcripts, this than leads to a reduction in total novel transcripts returned, which is what I am suspecting is occurring in your case. This would be more likely to occur if you are combining multiple sample types together from different tissues as highly scored known transcripts are usually present in multiple sample types, where as novel transcripts are more likely to be niche.

We found that analysis on one human tissue, an NDR of 0.1 is sufficient, however if you are combining multiple sample types, or you expect more novel tissues you may want to raise the NDR value. Alternatively you can not provide an NDR value and let Bambu suggest a value, were it attempts to predict how many missing annotations are present in your sample, and suggests an appropriate threshold. You may also want to do transcript discovery with an NDR of 1, and then manually filter out the novel transcripts yourself to the desired number, depending on your research goals.

I hope this answers your question, and how bambus multiple sample handling differs from the method stringtie merge uses which is more similiar to a union across samples.

Kind Regards, Andre Sim

apsteinberg commented 1 week ago

Hi Andre,

Thank you so much for this very thoughtful response. The background really helps me wrap my head around this.

I started to explore the distributions of NDRs for all the transcripts in my analysis, and there a couple of points which I was hoping you could help me understand:

1) When I compare the NDR distributions for all canonical vs non-canonical transcripts (transcript count vs NDR), I find that for the canonical transcripts, the NDR distribution appears to be greatly skewed towards lower values. In contrast, for novel transcripts, it appears that the transcript counts increase linearly with NDR. For the canonical transcripts, I guess there is weighting within your scoring algorithm which skews the distribution? This point was a bit fuzzy for me from the paper, but I guess if the algorithm is trained on the features of the canonical transcripts, then it seems that it would make sense that this bias would be present. For the non-canonical transcripts, this seems to make sense as well that as you get more permissive you include more and more transcripts, particularly as there is no bias towards these transcripts.

2) The other thing I found is that when I look at all transcripts regardless of NDR, the NDR never exceeds 0.4 for both non-canonical and canonical transcripts. Is this typical? I had naively expected NDR scores to go up to 1. And how do I interpret this? Does this mean that even if I included all transcripts in my analysis (I'm analyzing patient tumor samples), that I would still have an FDR < 0.4 for my analysis?

Thanks again for your time and help.

Best wishes, Asher

andredsim commented 1 week ago

Hi Asher,

  1. Yes as canonical transcripts are used in training the model, and are then effectively in the test set they will tend to have lower NDRs. Also canonical transcripts are usually the transcripts that are easier to find so naturally would get lower NDR scores even without the training-testing bias. We see this when testing the performance by removing canonical transcripts before doing the analysis.

  2. The maximum NDR is determined by the ratio of canonical vs non-canonical transcripts in the whole sample. It is essentially an inverse precision score. So a maximum of 0.4 means that of all the transcripts bambu could report, 40% of them are non-canonical (which is actually quite low, normally we see a maximum 0.6-0.7). If you are familiar with precision-recall curves, its equivilent to the point at the far right representing the lowest possible precision that can be achieved (which usually will not reach 0). If the maximum NDR you see across all the transcripts is 0.4 than increasing the NDR threshold past that point should not change your results and you would have an FDR <= 0.4.

Kind Regards, Andre Sim

apsteinberg commented 6 days ago

Hi Andre,

Thank you again, this is really helpful.

For 1, this makes sense thanks for explaining.

For 2, got i, thanks for explaining this point as well. This makes a lot of sense too. My last question related to this is if a max NDR of 0.4 is pretty low, is there anything I may want to try to increase the sensitivity in my analysis? And when you say you typically see a max NDR of 0.6-0.7, are these for cell lines or for human tissue samples? I could imagine that perhaps cell lines are more divergent than human tissue, even if the human tissue samples are derived from tumors. Also, could this potentially be related to improved Nanopore chemistry if the samples were recently collected?

Best wishes, Asher

andredsim commented 6 days ago

Hi Asher,

Ahh sorry when I meant an NDR of 0.4 was low I meant that it is abnormally good for a sample, since the lower the number is, the less false positives are in the dataset. The maximum NDR does not have a bearing on the sensitivity. Yes the samples I normally see higher max NDRs for are cancer cell lines so I think you are right in your explanation, but also improved chemistry with lower error rates and more full length transcripts would contribute to this too.

There are some other things you can do to increase sensitivity but we do not typically recommend it, these are setting remove.subsetTx to FALSE and min.readFractionByGene to 0 using the opt.discovery argument. The first means bambu will also score and output subset transcripts (incomplete splice matches) which will drastically increase the number of false positives. The second will not filter out transcripts that represent <0.05 (default) of a genes expression, this change won't be as bad for precision as the first one so I would attempt this first to see if it increases sensitivity to acceptable levels for you.

Kind Regards, Andre Sim

apsteinberg commented 4 days ago

Hi Andre,

Re: the low NDR. That's great, and thanks for explaining this really helps.

Re: increasing sensitivity. Okay sounds good -- I will give the latter, safer option a try first and see how it goes.

Thanks again for all your help.

Cheers, Asher

apsteinberg commented 4 days ago

Hi Andre,

Sorry for all the questions here, but upon examining these data more, I'm realizing that when I look at the readCount column for the merged summarized experiment object from the 42 samples (there are only 44, what I wrote before was a typo), I see that the minimum readCount across all transcripts is readCount = 84. Unless this is a concidence, this seems to suggest that merging in the way we discussed in issue #444 requires that any assembled transcript must have >= 2 full-length reads in each sample.

Is this some sort of artifact? I know that the default is for min.readCount = 2 by default, but I thought also that min.sampleNumber = 1, so I would expect that there should be assembled transcripts with minimum read counts going down to 2. If this is actually some sort of error, it could explain why the NDR values are so low for this dataset. Also, if this is an issue please let me know how you would modify what we had discussed doing in issue #444.

Thanks again, Asher

PS Also readCounts here are full-length read counts (vs partial + full-length read counts), right?

andredsim commented 3 days ago

Hi Asher,

No problems with all the questions, they have been insightful. Regarding this minimum count observation, that is not expected and could be a bug. I will see if I can replicate this on my end and investigate. I might need some time to look into this so I will get back to you hopefully with an explanation or solution.

re. PS. The readCounts in the rowData do not represent the final quantification counts, and is actually a column we might consider removing in the future, as it is a leftover from the transcript discovery step, for candidates that are already present in the known annotations. These read counts do not have a use once you have the output from the quantification step. The partial+full-lengh read counts are found in assays(se)$counts.

Kind Regards, Andre Sim