Merging read classes across samples results in fewer novel transcripts

apsteinberg commented 1 month ago

Hi Andre and the Bambu team,

I am encountering an issue related to issue #444 . I merged read classes across 44 samples as you had suggested and found that there were about 1000 novel transcript isoforms discovered. Originally, I had analyzed 9 of these samples using the same read classes, and I found there were 1300 novel transcript isoforms. In both cases, I fixed NDR = 0.1.

Why am I finding fewer novel transcript isoforms with the 44 samples vs the 9 samples? From my reading of your user manual on github. I thought that in multisample mode the samples were still analyzed independently, but then the merge simply unified novel transcript IDs across samples. Is it possible that somehow some of the putative novel transcripts ended up merged with canonical transcripts that were detected in the larger analysis? Further, how is this merge performed and how does it differ from stringtie's merge method? I know you mentioned it would increase my false positive rate to use the latter, but it is unclear to me how.

Thanks for your time and help.

Best, Asher

andredsim commented 1 month ago

Hi Asher,

Yes this might seem a bit unintuitive but it is expected behavior.

For background - We found that during multiple sample analysis as you add more and more reads you greatly increase the noise meaning more false positive transcripts are reported as novel (resulting in less accurate transcript discovery with more data). Therefore we use the presence of known annotations in the ranking of transcript candidates to determine how many novel transcripts we should return in order to control for this sample size/number bias. An NDR of 0.1 means a prediction score threshold is chosen where 10% of all transcripts below this threshold are novel.

If you add more samples and the ratio of highly scored novel transcripts increases over highly scored known transcripts, this than leads to a reduction in total novel transcripts returned, which is what I am suspecting is occurring in your case. This would be more likely to occur if you are combining multiple sample types together from different tissues as highly scored known transcripts are usually present in multiple sample types, where as novel transcripts are more likely to be niche.

We found that analysis on one human tissue, an NDR of 0.1 is sufficient, however if you are combining multiple sample types, or you expect more novel tissues you may want to raise the NDR value. Alternatively you can not provide an NDR value and let Bambu suggest a value, were it attempts to predict how many missing annotations are present in your sample, and suggests an appropriate threshold. You may also want to do transcript discovery with an NDR of 1, and then manually filter out the novel transcripts yourself to the desired number, depending on your research goals.

I hope this answers your question, and how bambus multiple sample handling differs from the method stringtie merge uses which is more similiar to a union across samples.

Kind Regards, Andre Sim

apsteinberg commented 1 month ago

Hi Andre,

Thank you so much for this very thoughtful response. The background really helps me wrap my head around this.

I started to explore the distributions of NDRs for all the transcripts in my analysis, and there a couple of points which I was hoping you could help me understand:

1) When I compare the NDR distributions for all canonical vs non-canonical transcripts (transcript count vs NDR), I find that for the canonical transcripts, the NDR distribution appears to be greatly skewed towards lower values. In contrast, for novel transcripts, it appears that the transcript counts increase linearly with NDR. For the canonical transcripts, I guess there is weighting within your scoring algorithm which skews the distribution? This point was a bit fuzzy for me from the paper, but I guess if the algorithm is trained on the features of the canonical transcripts, then it seems that it would make sense that this bias would be present. For the non-canonical transcripts, this seems to make sense as well that as you get more permissive you include more and more transcripts, particularly as there is no bias towards these transcripts.

2) The other thing I found is that when I look at all transcripts regardless of NDR, the NDR never exceeds 0.4 for both non-canonical and canonical transcripts. Is this typical? I had naively expected NDR scores to go up to 1. And how do I interpret this? Does this mean that even if I included all transcripts in my analysis (I'm analyzing patient tumor samples), that I would still have an FDR < 0.4 for my analysis?

Thanks again for your time and help.

Best wishes, Asher

andredsim commented 1 month ago

Hi Asher,

Yes as canonical transcripts are used in training the model, and are then effectively in the test set they will tend to have lower NDRs. Also canonical transcripts are usually the transcripts that are easier to find so naturally would get lower NDR scores even without the training-testing bias. We see this when testing the performance by removing canonical transcripts before doing the analysis.
The maximum NDR is determined by the ratio of canonical vs non-canonical transcripts in the whole sample. It is essentially an inverse precision score. So a maximum of 0.4 means that of all the transcripts bambu could report, 40% of them are non-canonical (which is actually quite low, normally we see a maximum 0.6-0.7). If you are familiar with precision-recall curves, its equivilent to the point at the far right representing the lowest possible precision that can be achieved (which usually will not reach 0). If the maximum NDR you see across all the transcripts is 0.4 than increasing the NDR threshold past that point should not change your results and you would have an FDR <= 0.4.

Kind Regards, Andre Sim

apsteinberg commented 1 month ago

Hi Andre,

Thank you again, this is really helpful.

For 1, this makes sense thanks for explaining.

For 2, got i, thanks for explaining this point as well. This makes a lot of sense too. My last question related to this is if a max NDR of 0.4 is pretty low, is there anything I may want to try to increase the sensitivity in my analysis? And when you say you typically see a max NDR of 0.6-0.7, are these for cell lines or for human tissue samples? I could imagine that perhaps cell lines are more divergent than human tissue, even if the human tissue samples are derived from tumors. Also, could this potentially be related to improved Nanopore chemistry if the samples were recently collected?

Best wishes, Asher

andredsim commented 1 month ago

Hi Asher,

Ahh sorry when I meant an NDR of 0.4 was low I meant that it is abnormally good for a sample, since the lower the number is, the less false positives are in the dataset. The maximum NDR does not have a bearing on the sensitivity. Yes the samples I normally see higher max NDRs for are cancer cell lines so I think you are right in your explanation, but also improved chemistry with lower error rates and more full length transcripts would contribute to this too.

There are some other things you can do to increase sensitivity but we do not typically recommend it, these are setting remove.subsetTx to FALSE and min.readFractionByGene to 0 using the opt.discovery argument. The first means bambu will also score and output subset transcripts (incomplete splice matches) which will drastically increase the number of false positives. The second will not filter out transcripts that represent <0.05 (default) of a genes expression, this change won't be as bad for precision as the first one so I would attempt this first to see if it increases sensitivity to acceptable levels for you.

Kind Regards, Andre Sim

apsteinberg commented 1 month ago

Hi Andre,

Re: the low NDR. That's great, and thanks for explaining this really helps.

Re: increasing sensitivity. Okay sounds good -- I will give the latter, safer option a try first and see how it goes.

Thanks again for all your help.

Cheers, Asher

apsteinberg commented 1 month ago

Hi Andre,

Sorry for all the questions here, but upon examining these data more, I'm realizing that when I look at the readCount column for the merged summarized experiment object from the 42 samples (there are only 44, what I wrote before was a typo), I see that the minimum readCount across all transcripts is readCount = 84. Unless this is a concidence, this seems to suggest that merging in the way we discussed in issue #444 requires that any assembled transcript must have >= 2 full-length reads in each sample.

Is this some sort of artifact? I know that the default is for min.readCount = 2 by default, but I thought also that min.sampleNumber = 1, so I would expect that there should be assembled transcripts with minimum read counts going down to 2. If this is actually some sort of error, it could explain why the NDR values are so low for this dataset. Also, if this is an issue please let me know how you would modify what we had discussed doing in issue #444.

Thanks again, Asher

PS Also readCounts here are full-length read counts (vs partial + full-length read counts), right?

andredsim commented 1 month ago

Hi Asher,

No problems with all the questions, they have been insightful. Regarding this minimum count observation, that is not expected and could be a bug. I will see if I can replicate this on my end and investigate. I might need some time to look into this so I will get back to you hopefully with an explanation or solution.

re. PS. The readCounts in the rowData do not represent the final quantification counts, and is actually a column we might consider removing in the future, as it is a leftover from the transcript discovery step, for candidates that are already present in the known annotations. These read counts do not have a use once you have the output from the quantification step. The partial+full-lengh read counts are found in assays(se)$counts.

Kind Regards, Andre Sim

apsteinberg commented 1 month ago

Hi Andre,

Great, I'm glad. :)

Re: minimum read counts. Okay sounds good, thanks for looking into this!

Re: interpretation of readCounts. Got it, okay thanks for explaining this, it is very helpful. I had been hoping that maybe I did not need to run the final quantification step and use these values, but it sounds like I should still be running the final quantification step regardless.

Thanks again, Asher

andredsim commented 1 month ago

Hi Asher,

Sorry about the wait on this. I have tried looking into this but wasn't able to replicate your observation. I was able to run multiple samples together and get $readCount entries with 2 in them. Even though I do not recommend using this number for your future analysis its still concerning you are observing it so it will be good to understand why before you continue further, in case it has unseen ramifications.

Could you paste your run script run here so I can compare my parameters?

A very silly troubleshooting question, but good to double check, are you sure you are providing 42 different samples, and not the same sample 42 times?

For me to look into this further I may need you to send me a minimal reproducible example so that I can debug it. One way you can generate this, is load in two read class files. Filter one for one chromosome and the other by another chromosome (preferably small chromosomes). Then try run these two samples for discovery (with NDR = 1). There should be no overlap between the two inputs, so theoretically there should be $readCount values == 2. If you only have values >= 4 then it will be a good example to send to me. You can shuffle the rowData(rcf)$readCount column of both read class files if needed for privacy reasons.

Let me know if this is possible or if you need more explanation

Kind Regards, Andre Sim

apsteinberg commented 1 month ago

Hi Andre,

No worries at all, thank you again for your help. I just took a look at my code, and I am very embarrassed to report that I think I was actually accidentally merging read classes from a single sample 42 times vs merging all 42 samples. I am re-running now to verify that this was indeed the case, and I will let you know as soon as I can confirm. This may also explain why I found fewer transcript isoforms in the 42 samples vs the 9 samples (though I know there are other important factors as you noted above)?

Also, in the merged summarized experiment object there are no sample names included if quantification mode is off? I was trying to check if I could find the sample names somewhere in the summarized experiment before I re-ran.

Thank you for your suggestions here, I think you may have saved me a potentially huge headache in the future!

Best wishes, Asher

andredsim commented 1 month ago

Hi Asher,

I am glad it was something simple to fix :)

Regarding sample names for the summarized experiment object, you can find the sample names on the read class list object using colData(se). If you mean the output after discovery, but before quantification, which is a grangesList, the sample names are not stored in this object unfortunately.

Hope this helps, Andre Sim

apsteinberg commented 1 month ago

Hi Andre,

Yes! It appears that this was indeed the issue, thank you again. :)

Re: sample names. Got it, I was referring to the grangesList -- but regardless it does seem that the source of the issue was from loading the same sample multiple times.

Looking at the results with the properly merged grangeslist, I do see that the number of novel transcripts with NDR < 0.1 is about the same in my 42 sample vs 9 sample analysis. However, there are many more novel transcripts assembled in the 42 vs 9 sample analysis, and the max NDR goes all the way up to NDR ~ 0.85. I did notice that the default NDR is now higher though; it's around NDR ~ 0.18. I was wondering if you could explain to me why this has shifted upwards? Is this due to the additional non-canonical transcripts which were assembled in the 42 sample vs 9 sample analysis?

Thanks again, Asher

andredsim commented 1 month ago

Hi Asher,

Yes if the default NDR shifts up, it suggests there are more highly ranked non canonical transcript candidates, ranking higher than known transcripts. Its not always because there are more, but the more data you have the more likely to include more real novel transcripts. I hope this answers your question.

Kind Regards, Andre Sim

GoekeLab / bambu

Merging read classes across samples results in fewer novel transcripts #448