How to appropriately interpret & utilize strand information in call-1

Hi Michael & fellow users,

I have a few questions about the strand information in JACUSA 2 output, and the proper usage of that information for RNA editing identification & frequency estimation:

In the variant detection scenario call-1 and given an RNA sample sequenced on the first strand, using -P RF-FIRSTSTRAND will allow some sites to have two lines of output, each for a strand, like this bed-like out output:

#contig start   end     name    score   strand  bases11 info    filter  ref
3       10084   10085   call-1  0.0     +       0,5,0,0 *       *       C
3       10084   10085   call-1  0.0     -       0,0,7,0 *       *       G
3       10085   10086   call-1  0.0     +       0,0,5,0 *       *       G
3       10085   10086   call-1  0.0     -       0,7,0,0 *       *       C
3       10086   10087   call-1  0.0     +       0,0,0,5 *       *       T
3       10086   10087   call-1  0.0     -       7,0,0,0 *       *       A
3       10087   10088   call-1  0.0     +       5,0,0,0 *       *       A
3       10087   10088   call-1  0.0     -       0,0,0,7 *       *       T

Above is a simple case of non-editing sites, where the ref allele is simply inverted, but the allele counts (here equal to depth) are not the same (but similar) for inverted bases. How to interpret this? My guess is that there are some reads mapped to both strands, contributing to allele counts on both strands, while reads that mapped specifically to either strands contribute to the difference in allele counts. Can you confirm on this?

#contig start   end     name    score   strand  bases11 info    filter  ref
3       10088   10089   call-1  1.556381955545767       +       2,0,3,0 *       *       G
3       10088   10089   call-1  1.4505124214123128      -       0,5,0,2 *       *       C

In a more complicated case, where RNA editing is detected on both strands, I feel uncertain about how to interpret this result. I wonder if it's ok to interpret this as true RNA editing events on both strands, or if we are more confident about editing called on one of the strands, based on certain statistics (e.g., Z score)?

If my guess about the first example is correct, it could be possible that the 3 reads covering G on the positive strand are contributed by the same reads that also mapped to C on the negative strand (base order : A, C, G, T). If unfortunately those reads were from negative strand-sourced transcripts, then RNA editing occurred at the genomic location on negative strand.

The problem is: we won't know what proportion of allele counts are actually contributed by the same reads, nor which strand of transcripts were those reads from.

I could think of the following options to determine which strand an editing is more likely to occurred at, when JACUSA2 detected candidate editing on both strands of a given site:

Simply pick the strand that has a higher z score (i.e., the divergence between sample and simulated base vectors, if in call-1).
Pick the strand that has a higher total coverage at that site; if coverage ties between strands, pick the strand with a higher z score.
Discard these sites on both strands.

Since the proportion of overlapping genes on opposite strands is generally low (an average of 3.3% of overlapping exons on opposite strands in human chromosome), I think it is fine to either discard these sites, or pick one strand and give up potential information on the other. What do you think?

I know this is a long discussion, but I would appreciate a lot if you could clarify my questions, and/or offer your insights in how to solve the dual-strand problem.

dieterich-lab / JACUSA2

How to appropriately interpret & utilize strand information in call-1 #69