Closed pansapiens closed 5 years ago
I don't think it matters, 0.9 threshold is fine provided it is easy to re-run with the strandedness forced by a parameter (and have it only re-run the minimal steps)
end of last year improved strand_guessing.py script to produce two files:
strandInfoAll.txt
strandInfoGuess
examples below
[bioinformatics1]~/.../sikRun/countFiles$ cat strandInfoAll.txt
sample ForwardStrandedCounts NonStrandedCounts ReverseStrandedCounts frac
Ctr_1 967801 37278052 37889607 -0.9501870531354021
Ctr_2 983324 37549713 38175507 -0.9497776631789646
Light_1 1147034 44542944 45306549 -0.9506159083573812
Light_2 1000089 37625117 38262926 -0.9490569432836474
[bioinformatics1]~/.../sikRun/countFiles$ cat strandInfoGuess.txt
ReverseStrandedCounts,-0.9501810300126254,0
strandInfoAll.txt
allows easy eyeballing for any individual samples outliers, and the guess files hasn't been changed.
guessing part has an assumption that all samples follow the same strand, but if there an issue with a couple of samples the guess might be wrong.
I'll look into including strandInfoAll.txt
into multiqc report, I think this would be useful addition.
I'll close this issue for now, but feel free to open it again people.
cheers
p.s the change was in this commit 0487680f4aa3f203ea95906a9eda7aeb39730306 and I'm using pandas now as well, will update docs respectively
This looks like a more comprehensive method that could be used for strandedness guessing: https://github.com/betsig/how_are_we_stranded_here
thanks @pansapiens, but I don't think I'll be including that tool. It uses kalisto under the hood to get bam files and then runs infer_experiment.py
from RSeQC.
Whereas RNAsik works directly off featureCounts
output. I anticipate using that tool will be slower, and in general doesn't fit into RNAsik workflow. (Although I've been meaning to add salmon/kalisto pseudo-aligners support)
In terms of RSeQC tool, it is nice, and there is even a fork here that I tried to get going a while back to include some of the modules into RNAsik. But for some reason it just doesn't stick and I don't use it
Cheers
Not necessarily expecting that the tool itself would be integrated, but it's a good reference point for a method that gives more reliable strandedness guesses, especially if one of the pseudo-aligners is included as a dependency anyway at a later date.
Looking at the code a bit more, it isn't bad approach and it can work with any bam files. There is something about RSeQC
that it has good ideas, but ok execution and usability, and I can't trust it 100% for some reason
if len(p_strandness) >0 and len(s_strandness) ==0 :
protocol="PairEnd"
#for k,v in p_strandness.items():
# print >>sys.stderr, k + '\t' + str(v)
spec1= (p_strandness['1++'] + p_strandness['1--'] + p_strandness['2+-'] + p_strandness['2-+'])/float(sum(p_strandness.values()))
spec2= (p_strandness['1+-'] + p_strandness['1-+'] + p_strandness['2++'] + p_strandness['2--'])/float(sum(p_strandness.values()))
other = 1-spec1-spec2
function name that does the heavy lifting and the code above is def configure_experiment(self,refbed,sample_size, q_cut = 30)
in lib/qcmodule/SAM.py
Just for my own reference while I understand this. 1++
means read 1 from the pair mapped to the forward strand and the gene is also forward stranded, this is under assumption that the read actually mapped to the gene body (exon)
I guess the bit about RSeQC
make me think it could do better, why do the sum and only report spec1
and spec2
? Would you want to know number of individual alignments? Why not report all 8 possible alignments and all the aligned/unassigned reads
This is why I forked RSeQC
in the first place (what looks like many years ago now) and I just wanted to tweak so that the numbers are little more verbose and meaningful.
It's great that RNAsik can mostly guess whether a library prep is 'stranded' or not based on the strand bias of the mapped reads - however I've seen cases where the 0.9 threshold is not quite met (~0.85), but the run should (probably?) be still treated as stranded.
I wonder if there is a better statistical treatment rather than simply using a fixed threshold independent of library size. I've discussed using a z-score as one option offline (with @tsonika), but on reflection I'm not clear on how it would be constructed for deciding if a run is stranded.
As another point of reference - this how GATK does it at the variant level: StrandOddsRatio (or maybe Fisher Exact test for low coverage) - maybe this could also be applied at the mapped read level ?