Best options for quantifying a transcript fully contained within a larger transcript

COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment

https://combine-lab.github.io/salmon

GNU General Public License v3.0

751 stars 159 forks source link

Best options for quantifying a transcript fully contained within a larger transcript #514

Open jasonvrogers opened 4 years ago

jasonvrogers commented 4 years ago

Hi,

This is not an issue, but I'm just looking advice on this problem I've been trying (with limited success) to tackle on my own for a while.

First, I've been using salmon for years and I love all the updates that have come about recently. The new selective alignment mode is an amazing upgrade!

My general question here is how to quantify various truncated isoforms of the same transcript. I'm working in S. cerevisiae and I am not referring to isoform switching via alternative splicing, but various isoforms that are fully contained as a subset of a larger transcript sequence.

Imagine a standard gene that gives rise to a 1000 bp transcript. Through alternative promoter usage or post-transcriptional processing, it can give rise a truncated form that is identical to the parent transcript, except only has sequence from bases 400 to 1000. I know this scenario is similar to detecting differential splice forms, and indeed when I see overlapping genes that nonetheless have some unique sequence, salmon does an excellent job at fractionally apportioning the reads. In this particular case, however, it seems that because all mappings to the truncated isoform also give perfect mappings to the longer isoform, even a single few reads mapping uniquely to the longer isoform is enough evidence for the salmon EM algorithm (I assume that's the relevant part?) to give all of the reads to the long isoform. To visualize the problem, I'm attaching an image of alignments over a transcript that is only expressed as a portion of the annotated parent transcript. If i encode the short and long forms as separate transcripts in salmon fasta index, all of the reads get assigned to the longer form with the options I've tried.

My question then is whether there's a combination of salmon quant options that would optimize apportioning the reads better for this class of transcript, or whether it's simply not going to work given the underlying assumptions in the software?

Thanks a ton for any help, Jason

hmlalpha2_truncated_example

rob-p commented 4 years ago

Hi Jason,

Thanks for the kind words, and for the detailed issue! This is sort of a tough one, since salmon tries to work out the relative abundance of these different transcripts in the way that maximizes the likelihood of the observed data (or, more specifically, maximized the ELBO in the variational Bayesian framework). Of course, you seem to know that already :).

One thought that comes to mind, though, is the following. The default settings for salmon favor sparsity of the solution pretty strongly — it is important to explain the data with as few distinct transcripts as possible. While this often seems a nice thing to do, it can tend to lead to the type of behavior that you are seeing.

The way to modify this is to alter the --vbPrior parameter to salmon. Basically, this number encodes the prior observation weight that should be attributed to each isoform before accounting for the data. The default value for this parameter is 0.01. Small values (<< 1) are sparsity inducing, while larger values are not (and values close to 1 and above actually penalize sparsity). You could try quantifying with a few (larger) values of this parameter to see if any of them give allocations among these isoforms that seem to make more sense to you (or that agree more strongly with any alternative assays). Actually, I'd be very interested in hearing any feedback you have about this if you find a setting that is more in line with what you expect!

Best, Rob

jasonvrogers commented 4 years ago

Thanks for the feedback!

I tried playing with the vbPrior setting and observed that, as you noted, higher increases of the vbPrior tended to flatten out the read apportionment, such that as the vbPrior increased the two transcripts became increasingly similar in their final expression (presumably they would eventually hit 50/50).

It's good to know how that settings affects my data, but this is not quite what I was hoping for...

Ideally, the short transcript would get nearly all of the reads, rather than splitting the reads 50/50 or, with the default settings, giving nearly all the reads to the longer transcript. I realized that, as a human, the reason the short transcript is obviously the dominant one is how the reads pileup in the alignment. There are hundreds of reads mapping to both transcripts, but NO reads map to the 5' of the long transcript.

As I understand the selective alignment, the alignment scores are passed to the quantification step, but the position of the reads is not used downstream. In order to pass my human intuition along here, the software would need to pay attention to the coverage bias of the reads mapping to the transcripts and assign a penalty when two otherwise identical transcripts have a different coverage variance across the transcript. This sounds like what the --posBias flag should incorporate into the effective lengths, but it doesn't have much effect on these transcripts for me (FYI, I am getting a segfault when I run only --posBias in the current salmon version, but if I run all the models together like --gcBias --seqBias --posBias, it completes fine).

Also, my intuition for these transcripts is not really a coverage "bias" as much as the read depth absolutely plummeting at the 5' end of the long transcript. It would be neat if Salmon could detect these kinds of dramatic dropoffs and add a warning or something... even if not incorporating the information into the quants... it could even be a good QC step to identify large deletions/insertions over a gene body. As far as I know, there are NO rnaseq quant programs that would handle this, because even something like a STAR -> RSEM pipeline just projects read counts to the transcriptome and doesn't incorporate the coverage information.

So, for now my workaround is to just modify the transcripts so they are non-overlapping in the transcriptome fasta or to manually count reads after looking at the alignments, but I'd love to hear any more thoughts you have on this problem.

Thanks, Jason

rob-p commented 4 years ago

Hi Jason,

Thanks for the super-detailed feedback! A couple of thoughts / questions:

FYI, I am getting a segfault when I run only --posBias in the current salmon version, but if I run all the models together like --gcBias --seqBias --posBias, it completes fine.

~~Do you have a small example (ref / read pair) that reproduces this? It would be great to figure out why and fix it. We could split that into a new issue if you'd rather.~~

P.S. Nevermind; thanks to you mentioning this, I was able to track it down and fix it in develop!

As I understand the selective alignment, the alignment scores are passed to the quantification step, but the position of the reads is not used downstream.

Well, yes and no. We make extensive use of the position when estimate the implied fragment length (distance between paired end reads) and then model the conditional probability of this fragment based on the global fragment length distribution. This is just as much as is done by e.g. RSEM. However, you are right that there is no notion of using the coverage profile in estimation (more on this below)!

Also, my intuition for these transcripts is not really a coverage "bias"

My intuition agrees with yours here completely. First, this isn't really a coverage bias as we use the normal definition of the term. Second, the positional bias modeling in salmon is not on a per-transcript level (since that would be an astronomical number of different parameters to learn, and any procedure would almost certainly overfit). Instead, it groups transcripts into length bins, and learns a distinct coverage bias model per-bin.

It would be neat if Salmon could detect these kinds of dramatic dropoffs and add a warning or something... even if not incorporating the information into the quants... it could even be a good QC step to identify large deletions/insertions over a gene body. As far as I know, there are NO rnaseq quant programs that would handle this, because even something like a STAR -> RSEM pipeline just projects read counts to the transcriptome and doesn't incorporate the coverage information.

These are great points! A couple of thoughts. First, you are right that salmon, RSEM, etc. don't use coverage information in the way you describe here. One piece of software you might look into is Salmon Anomaly Detection (paper here). This is sort of akin to what you are suggesting, and post-processes salmon output by looking for anomalous coverage profiles. It can both flag "suspicious" transcripts, and can also sometimes move reads around to mitigate anomalous coverage. Another tool / metric you might consider is the junction coverage compatibility (paper here). While both of these approaches get at some of the core intuitions you raise in your response, they are both rather "heavyweight", and neither, of course, is built into salmon.

So, I completely agree that a lightweight version of something like SAD would be great to have built into salmon. Specifically, it makes a lot of sense to have some component of the likelihood account for the coverage profile of transcripts. While I don't know of any widely-used and actively maintained quantification tools that do this, the idea for this was proposed in the iReckon paper and a coverage-based heuristic was introduced. However, the coverage was not directly incorporated into the likelihood. Rather, a variant of the normal likelihood function was used and then coverage was used to select between different potential solutions that were otherwise of similar likelihood.

Given issues like the one you see here, and the ones that we observed in the JCC paper and that Cong and Carl observed in the SAD paper, it seems clear that it would be a big win for a quantification tool to include some sort of built-in lightweight model for things like this. The big questions are (1) how do you fold this type of intuition formally into the probabilistic model and (2) is it possible to incorporate this information efficiently? I'm very interested in pursuing something like this if it can be made to work efficiently.

Thanks! Rob

jasonvrogers commented 4 years ago

Hi Rob,

Thanks for being so interested in this! I'm blown away by your support. And thank you for preemptively fixing the bug before I could even post an example!

One piece of software you might look into is Salmon Anomaly Detection (paper here). This is sort of akin to what you are suggesting, and post-processes salmon output by looking for anomalous coverage profiles. It can both flag "suspicious" transcripts, and can also sometimes move reads around to mitigate anomalous coverage. Another tool / metric you might consider is the junction coverage compatibility (paper here).

Excellent papers! Definitely going to give those tools a try on my data.

So, I completely agree that a lightweight version of something like SAD would be great to have built into salmon. Specifically, it makes a lot of sense to have some component of the likelihood account for the coverage profile of transcripts.

Yes! I've been analyzing a large dataset and my real motivating problem was not really the example I posted above, but distinguishing between pre-processed and fully-processed non-coding RNA transcripts. I'm attaching an image showing an example ncRNA; the two tracks are the same data, but the lower one shows abundance on a log scale. In this particular sample, it's easy to estimate that ~5-10% of the transcripts are pre-processed (the transcripts still have neighboring genomic DNA). I wanted to see how this ratio changes between samples (for example, in a snoRNA processing-defective mutant strain), but quickly realized this is not easily done in salmon or any other quant tools because the processed transcript is entirely a subset of the sequence of the pre-processed transcript. The only way to accurately quantify it is to use the coverage information, which as you agreed is not really taken into account downstream.

If Salmon could incorporate the coverage information to solve this class of problem, that would indeed be a huge win. I think the ncRNA example would be both a great biologically-interesting motivating problem, as well as a good technical benchmark for implementing any new methods. It could even be used as a secondary RNA velocity measure in scRNA seq data, provided the method used can detect these (non-polyadenylated) transcripts.

The big questions are (1) how do you fold this type of intuition formally into the probabilistic model and (2) is it possible to incorporate this information efficiently?

This is definitely your domain of expertise (and I know it's a rhetorical question but I'd love to throw some ideas out here)... I can think of a few mostly heuristical approaches....

1) when apportioning reads to transcripts, after the mapping phase, incorporate a notion of "evenness" into the EM step... reads that decrease the variance in coverage are favored over reads that increase the variance (so, define read depth per 10 bp window or something and calculate the variance across all windows for the transcript, then try to assign reads such that they minimize read depth variance per isoform). The problem here is actual coverage biases may then masquerade as alternative transcript isoforms...

2) Use the information from the unique sequences between the transcripts... the read depth over unique regions updates the prior on the overall transcript abundance, and the otherwise non-unique reads get apportioned in accordance with the unique-region-derived prior...

But as I think about it... I realize I don't really know the underlying algorithmic details of the existing implements. But it would be amazing if you could incorporate this type of information into Salmon. I really hope some progress can be made here!

Thanks again for helping me out and showing interest in the motivating problem!

P.S., As a total aside, I've been working with this large yeast RNAseq dataset and eventually reached the same conclusions as the selective alignment paper and other recent ones; that is, the most important aspect for getting good transcript-level quantifications is not aligning to the genome vs. aligning to the transcriptome, but rather having an accurate transcriptome annotation to begin with. I saw huge gains from updating my transcriptome annotation to include UTRs, especially given differences in coverage bias between samples... for example, if the actual transcript is 500 bp but the gene body is only 200 bp, slight coverage biases can propagate non-linearly and cause huge problems downstream. This got me thinking... if the end goal is differential expression analysis (and obviously this is not always the end goal), what if we just discard the notion of a pre-defined transcriptome and stick with equivalence classes, then do differential expression analysis on the equivalence classes themselves (perhaps calculated against the whole genome... this is feasible in yeast, maybe not in humans), then only after discovering significant differential expression one could work backwards to interpret the changes. Is this a crazy idea? Or not crazy at all and already routine? I know salmon can dump the counts to each equivalence class already so it's not hard for me to do what I just described, but I'm wondering if you have any opinions/insights into this idea. Thanks again!

snr40_isoforms

mohsenzakeri commented 4 years ago

Hi Jason,

We've been working on different ways for incorporating the coverage information in Salmon quantification model to improve the quantification estimates of RNA-seq, more specifically improving the estimates for the case you provided here where one transcript is totally contained in another transcript. As you mentioned, using the coverage information could be very useful for assigning the reads to the correct transcript in this situation.

One challenging part for tackling the problem was actually finding/simulating a sample in which this problem occurs. For example, for two transcripts t1 and t2, where t1 is fully contained in t2. I simulated reads only from the smaller transcript t1 with an addition of few reads from the specific regions of the larger transcript, t2. Based on the discussion here, I was expecting to see most of the reads be assigned to the larger transcript, while this was not the case when I ran the quantification in the EM mode. So, I would like to ask if you could please provide us with the specific sample you observed this behavior with, so that we could look into it more closely and then, hopefully, find a solution for resolving this issue with the quantification results.

Thanks, Mohsen

jasonvrogers commented 4 years ago

Hi Mohsen,

Thanks for looking into this!

I'm attaching a small dataset to reproduce the issue. I extracted the reads aligned to the region in the screenshot above (the snR40 transcript) for that sample. I've included two fasta files, one with the standard snR40 transcript, and one that has a second transcript called snR40_genomic which includes upstream/downstream sequences. Based on the image above, you would expect >90% of the reads to apportion to the smaller transcript, but instead I see the reads going ~50/50 when I run salmon on the fasta file with both indexes.

I've included an "index" folder with a script to build each index and a salmon.sh script to reproduce my results. My salmon binary version is 1.2.1. Please let me know if this works for you, or if you need anything else from me. Happy to share the full fastq files if that's helpful too.

Thanks, Jason

snr40_example_github.gz

jasonvrogers commented 4 years ago

Hi Mohsen and Rob,

So sorry if you've already been troubleshooting the example data I gave you. I realized that that is not a good example of the problem. In this example, there are snR40 and snR40_genomic transcripts, representing processed and pre-processed isoforms. However, it just so happens that there is residual adapter on some of the reads I provided and the first nucleotide of the adapter sequence actually matches the first nucleotide of the longer, genomic version of this transcript, therefore, the genomic variant gets a slightly better alignment score, as it should. After hard trimming any residual adapter the results for this transcript were a lot better (although still not quite the ratio I would expect).

I have quite a few examples like this and I'm fairly sure they are not all explained by alignment of adapter sequences. However, I just wanted to let you know in case you were already troubleshooting my example data. I'm aggregating a handful more general examples of the same problem, but ones without a trivial solution like the one I provided. The files are too large to attach on github directly, though, do you have a preferred way to share the files? Maybe a google drive?

mohsenzakeri commented 4 years ago

This data works pretty well for demonstrating what you are experiencing. Thank you very much for providing the example.

One interesting observation I had so far about the data is using the "softclip" option seems pretty important for mapping a lot of the reads, specifically for the reads that only map to the shorter transcript in this example.

mohsenzakeri commented 4 years ago

Hi @jasonvrogers ,

As I looked more closely to the data you had provided, I realized why you were saying this is not actually a good example for the case you were talking about. I see how a lot of reads have a better alignment score just because of that one extra match. One other observation we had, is that not using --softclip makes most of these reads only align to the longer transcript. But then, what we realized is that using --softclipOverhang (instead of --softclip) makes the expression of the shorter transcript much higher than the longer one. The reason is that with --softclipOverhang, bases are softclipped only if they overhang at either ends of the transcript, which makes overhanging reads (to snR40) map with a better score to the shorter transcript (they map with high number of mismatches to the longer isoform as softclip is not allowed there with this option). So, we recommend using --softclipOverhang on your data and see if that makes any changes for other reads as well.

Also, it would be great if you could share the other files in which the coverage profile is more critical for making the right call for assigning the reads. I can invite you to a box folder if you share your email with me, then I think you can upload your data there for us.

Thanks, Mohsen

jasonvrogers commented 4 years ago

Hi @mohsenzakeri ,

Sorry for the slow reply! I’ve done some sleuthing and have (seemingly) figured out what’s going on. I compared tons of overlapping transcript scenarios and played with the salmon options and concluded the following:

Success scenarios: Generally, with default options, Salmon does an excellent job at assigning reads to overlapping transcripts the same way that a human would. Whether or not the transcripts overlap slightly or one is entirely contained within another is irrelevant.
Failure scenarios: In some scenarios with overlapping transcripts, read assignment can be very strange and unintuitive, especially when one of the transcript isoforms is much more abundant than the other (the abundant transcript tends to “steal” reads from the less abundant one).
Both the success and failure scenarios are due to the length bias model used when estimating the abundance of transcripts. Totally disabling this with the —noLengthCorrection flag (NOT the —noEffectiveLengthCorrection flag) actually creates the transcript within a transcript failure scenario that I mistakenly thought was the original issue. That is, when length bias modeling is turned off, then longer transcripts will always get assigned all of the reads that multimap to shorter transcripts. -Therefore... if you did want to tackle the transcript within a transcript scenario to build a coverage bias model, you probably want to disable the length bias modeling or at least consider how it would interact with coverage modeling.

With that said, I'm sharing an example that illustrates each of the above points and a link to a toy dataset that you can use to recreate the examples or explore this further. If you'd like to dig deeper into this, free free to e-mail me at jason@calicolabs.com, I have tons more notes and data that I'm willing to share. Dataset is in google drive (you'll have to click the link and request access to view it) https://drive.google.com/drive/folders/1LcJNa4PHNoYqGsnkRx0YxvNXnNJCVyq9?usp=sharing

Success scenarios with default options:

In the below IGV snapshots, I show the read alignments for one sample. The top GTF annotation is the default gene annotation, and the GTF at the bottom shows the new transcript isoforms I made and quantified on (this index is called "extras").

For each example I ran salmon on the transcripts from the default or extra index, with standard options (only --validateMappings), with or without the --noLengthCorrection flag. I'm showing only the number of reads assigned to each transcript, not the TPM. I also tried this on more samples and transcript scenarios and saw the same trends.

Nested transcript isoforms: AGP1_example

Looking first at the "extras" index + default options, almost all the reads are assigned to AGP1_long1, which indeed seems to be the best fit for the actual gene body + UTRs (the default AGP1 transcript is only protein coding ORF, no UTRs). Even though all of the reads from AGP1_long1 would also multimap to AGP1_long2, etc., AGP1_long1 is assigned all the reads because it has the highest reads per kb. Presumably the few reads assigned to AGP1_long4 are due to the small tail of reads at the right end (5') of the gene.

In contrast, with the --noLengthCorrection flag the reads are evenly split between the long4 and long5 isoforms, as presumably all of these reads multimap to both transcripts, so the reads get assigned randomly. Even though the long1 transcript also contains >90% of the reads, the few extra reads mapping uniquely to the longer transcripts causes it to "steal" all the reads from the shorter ones. This is an example of the original "transcript within a transcript" problem.

Alternative transcriptional stop site isoform: KCC4_example

In this example, there are two regions in KCC4 with obviously different coverage. Ideally we would be able to have a default KCC4 transcript and a truncated isoform in the salmon index, and it would assign the reads appropriately, even though all of the reads that map to the truncated form would also multimap to the long form.

Again, you can see in the table that salmon assigns reads parsimoniously to both transcripts with the default options, but with the length bias modeling turned off ALL of the reads are assigned to the long transcript. I also added a third transcript to the right end of the transcript which is inconsistent with the coverage profile and, as hoped, salmon did not assign any reads to that variant.

So, in these two scenarios the default options produce nice results in line with our human intuition.

Failure scenario with default options:

In this example there are four genes (oriented in the same direction) with wildly different expression levels. I added a "PDI1_SuperTranscript" which stretches from the 5' end of PDI1 to the 3' end of POF1 (so, all reads from all 4 genes would multimap to the super transcript). This is a contrived example to illustrate the technical details, but you could imagine similar biological scenarios, especially regarding splicing isoforms.

With the default options, you get the counterintuitive result that all of the reads from just MGR1 and POF1 (the two lowest abundance transcripts) are assigned to the super transcript. EMC1 loses ~50% of its reads to the super transcript, and PDI1 only loses ~10%. I'm not showing it, but if you remove the default PDI1 transcript from the index (so it's just the super transcript + the 3 genes MGR1/EMC1/POF1), all three of them lose all of their reads to the super transcript... meaning that whether or not EMC1 gets assigned any reads depends entirely on the presence of a non-overlapping gene, PDI1, in the salmon index. This is definitely at odds with our intuition from looking at the coverage plots, but makes sense when you break all the transcripts down to a simple reads per kb equation.

As before, if you turn off length modeling then all of the reads get assigned to the super transcript.

I hope this was insightful and cleared up the issue a bit. Feel free to e-mail or reply here.

Best, Jason

hiraksarkar commented 4 years ago

Hi @jasonvrogers ,

This is a very interesting problem, and I (a senior Ph.D. student with @rob-p ) have been working with this kind of outlier cases for a while now (almost done with my Ph.D.). In my thesis, I am keeping a chapter for opportunities and challenges in bulk RNA-seq quantification, and this particular example, I referred to as "the subset effect" is perfectly showcasing one of the challenges.

I want to refer your examples in one of the subsections of the chapter (citing this issue and your name), in case you give permission to use it in the thesis.

Let me know.

Thanks Hirak

jasonvrogers commented 4 years ago

Hi @hiraksarkar ,

Absolutely feel free to use this example! Let me know if you need any more info.

-Jason

mohsenzakeri commented 3 years ago

Hi @jasonvrogers,

First of all thank you so much for sharing this thorough analysis with us, it was very clear and helpful for understanding the details of the problem you are referring to.

Secondly, I apologize for the long time it took me to get back at you. I would like you to know that we have been and are still working on possible solutions for addressing this problem, and here I would like to share some updates with you.

About the success cases, it was nice to know that the current model of Salmon with length correction works pretty well in recovering the right estimates for those "easier" cases where one transcript is fully contained in another one. Turning off the length correction, tells the Salmon model not to consider the effective length of each transcript for computing the conditional probabilities of originating a fragment from a transcript. So, for the RNA-seq data there is no reason to turn off this term of the model, and we highly recommend not to use that flag for the bulk RNA-seq abundance estimation with Salmon.

Looking more carefully at the 2nd case you have posted as the failure case, it is interesting to see that there is a very nice visual evidence on the super transcript that the long transcript might not be expressed at all. I am referring to the zero coverage regions on the Super Transcript between the regions corresponding to the smaller transcripts, e. g., between POF1 and EMC1. So, we tried a solution that inspects the coverage profile of all transcripts and calculates the probability of observing a zero coverage region on each transcript. If this probability is too low, this would be counted as an evidence for a transcript not being expressed at all. This approach seems to be working fine on this example that you have shared here. however, one problem was that there were considerable number of reads in the sample that were uniquely mapping only to the Super Transcript and turning of the expression of that transcript would result in treating those reads as un-mapped. Furthermore, this problem was more evident when we tried that approach on other larger samples, it seemed that could effect the expression of a lot transcripts very significantly. Specially, on the real samples where the coverage are often not uniform and detecting a zero coverage region on a transcript is more common due to un-annotated transcripts in the samples and etc. Currently, we are actively looking for more thorough solutions for this problem to deal with the coverage profile of transcripts. I'll try to update you more as we make more progress about this.

Thank you again for the detailed explanation, hope to get back at you soon.

Best, Mohsen

jasonvrogers commented 3 years ago

Hi Mohsen,

Thank you for your interest in this!

It really is a hard problem, and I would like to emphasize that I think salmon does an amazing job as it is. The examples I've shown are pretty unlikely to be problematic in the majority of real datasets, but certainly they do arise in some instances (like with many splicing isoforms or alternative transcriptional start/stop sites).

I'm curious to know how things progress on the issue. It seems like using the coverage as evidence of the transcript "not being expressed at all" may be too binary for this problem, certainly in real data, including my own, there are serious coverage dropoffs with long genes or low sequencing depth, and it doesn't mean the transcript is not expressed.

Second, regarding unique mappers to the "super transcript," this highlights the problem with using gene bodies that do not incorporate UTRs as in my examples above. The quantification can only be as good as the ground truth transcriptome you're working with, and so every gene is going to be a bit wrong if UTRs are not included in the transcriptome sequence. For my real data analysis, I've dealt with this by 1) adding UTRs to all genes whenever possible, and if no UTR data is given, then extending each gene body by ~100 bp in either direction (if two genes end up overlapping a bit, this ends up not being problematic at all because of how salmon apportions reads in accordance with the length bias model). Second, I set the mapping flags to --softclip and --minScoreFraction 0.3; this helps a LOT since if one read mate pair maps perfectly within a gene and the other is in an unannotated UTR region, it will still assign the read correctly to the gene.

Lastly, for solving the problem at hand with the "super transcript" scenario and using coverage... perhaps instead of looking for regions of zero coverage, you could keep track of the overall variance in read depth over a gene and assign reads in a way that minimizes variance; in my example, since the super transcripts has regions of very high read depth, zero read depth, and intermediate, the total variance in read depth across the gene would be quite high, while the variance over the sub-transcripts would be much lower. Similarly, in a scenario with a gene that has a strong coverage bias, even if it has regions of zero depth somewhere within it, a super transcript that contains that gene plus other regions outside the gene, which also have zero depth, would by definition have an even higher variance, so you could use that to penalize the read assignment.

Of course, that's just my first thought on the issue and there may be technical or biological reasons why that isn't the best solution. Let me know if you have any success or want to brainstorm a bit.

Best, Jason