MonashBioinformaticsPlatform / RNAsik-pipe

RNAsik - more than just a pipeline
https://monashbioinformaticsplatform.github.io/RNAsik-pipe/
Apache License 2.0
13 stars 5 forks source link

Better estimation of strand bias / strandedness #35

Closed pansapiens closed 5 years ago

pansapiens commented 5 years ago

It's great that RNAsik can mostly guess whether a library prep is 'stranded' or not based on the strand bias of the mapped reads - however I've seen cases where the 0.9 threshold is not quite met (~0.85), but the run should (probably?) be still treated as stranded.

I wonder if there is a better statistical treatment rather than simply using a fixed threshold independent of library size. I've discussed using a z-score as one option offline (with @tsonika), but on reflection I'm not clear on how it would be constructed for deciding if a run is stranded.

As another point of reference - this how GATK does it at the variant level: StrandOddsRatio (or maybe Fisher Exact test for low coverage) - maybe this could also be applied at the mapped read level ?

drpowell commented 5 years ago

I don't think it matters, 0.9 threshold is fine provided it is easy to re-run with the strandedness forced by a parameter (and have it only re-run the minimal steps)

serine commented 5 years ago

end of last year improved strand_guessing.py script to produce two files:

examples below

[bioinformatics1]~/.../sikRun/countFiles$ cat strandInfoAll.txt 
sample  ForwardStrandedCounts   NonStrandedCounts       ReverseStrandedCounts   frac
Ctr_1  967801  37278052        37889607        -0.9501870531354021
Ctr_2  983324  37549713        38175507        -0.9497776631789646
Light_1 1147034 44542944        45306549        -0.9506159083573812
Light_2      1000089 37625117        38262926        -0.9490569432836474
[bioinformatics1]~/.../sikRun/countFiles$ cat strandInfoGuess.txt 
ReverseStrandedCounts,-0.9501810300126254,0

strandInfoAll.txt allows easy eyeballing for any individual samples outliers, and the guess files hasn't been changed.

guessing part has an assumption that all samples follow the same strand, but if there an issue with a couple of samples the guess might be wrong.

I'll look into including strandInfoAll.txt into multiqc report, I think this would be useful addition.

I'll close this issue for now, but feel free to open it again people.

cheers

p.s the change was in this commit 0487680f4aa3f203ea95906a9eda7aeb39730306 and I'm using pandas now as well, will update docs respectively

pansapiens commented 3 years ago

This looks like a more comprehensive method that could be used for strandedness guessing: https://github.com/betsig/how_are_we_stranded_here

serine commented 3 years ago

thanks @pansapiens, but I don't think I'll be including that tool. It uses kalisto under the hood to get bam files and then runs infer_experiment.py from RSeQC. Whereas RNAsik works directly off featureCounts output. I anticipate using that tool will be slower, and in general doesn't fit into RNAsik workflow. (Although I've been meaning to add salmon/kalisto pseudo-aligners support)

In terms of RSeQC tool, it is nice, and there is even a fork here that I tried to get going a while back to include some of the modules into RNAsik. But for some reason it just doesn't stick and I don't use it

Cheers

pansapiens commented 3 years ago

Not necessarily expecting that the tool itself would be integrated, but it's a good reference point for a method that gives more reliable strandedness guesses, especially if one of the pseudo-aligners is included as a dependency anyway at a later date.

serine commented 3 years ago

Looking at the code a bit more, it isn't bad approach and it can work with any bam files. There is something about RSeQC that it has good ideas, but ok execution and usability, and I can't trust it 100% for some reason

                if len(p_strandness) >0 and len(s_strandness) ==0 :
                        protocol="PairEnd"
                        #for k,v in p_strandness.items():
                        #       print >>sys.stderr, k + '\t' + str(v)
                        spec1= (p_strandness['1++'] + p_strandness['1--'] + p_strandness['2+-'] + p_strandness['2-+'])/float(sum(p_strandness.values()))
                        spec2= (p_strandness['1+-'] + p_strandness['1-+'] + p_strandness['2++'] + p_strandness['2--'])/float(sum(p_strandness.values()))
                        other = 1-spec1-spec2

function name that does the heavy lifting and the code above is def configure_experiment(self,refbed,sample_size, q_cut = 30) in lib/qcmodule/SAM.py

serine commented 3 years ago

Just for my own reference while I understand this. 1++ means read 1 from the pair mapped to the forward strand and the gene is also forward stranded, this is under assumption that the read actually mapped to the gene body (exon)

I guess the bit about RSeQC make me think it could do better, why do the sum and only report spec1 and spec2 ? Would you want to know number of individual alignments? Why not report all 8 possible alignments and all the aligned/unassigned reads

This is why I forked RSeQC in the first place (what looks like many years ago now) and I just wanted to tweak so that the numbers are little more verbose and meaningful.