Running replicates statistics on bins and bed files

cnluzon commented 3 years ago

Included a bwstats.R with functionality to compute statistics based on DESeq2 package for both bins and BED files.

Added functions: bw_bed_diff_analysis, bw_bins_diff_analysis. These operate on two lists of bigWig files, plus labels (labels are mandatory at this point because I believe this makes analysis more robust to mistakes like forgetting what was compared against what). So at this point you need to provide at least "treated" "untreated" or some labels to identify the groups of bigWig files.

This is a work in progress. I still need to add some automated testing. Even though at this point I am not including actual values testing (relying on DESeq2) I am going to include at least tests to check that it runs and that the parameters passed go where they are supposed to go.

Since we mostly operate on scaled bigWig files, the estimateSizeFactors step that usually is performed with DESeq2 is overridden. But a parameter estimate_size_factors is provided:

bw_bins_diff_analysis(bwlist_1, bwlist2, "treated", "untreated", estimate_size_factors = TRUE)

will run a normal DESeq2 function. I provide this for the cases where we are not looking at scaled bigWig files. But the default value of this parameter is set to FALSE.

Returning value of these functions is a results table as the one obtained when you call DESeq2::results, which means that for any analysis you still need to set some cutoff thresholds for pvalue and / or fold change.

shaorray commented 3 years ago

One detail about DESeq's testing: the input counts need to be an integer larger than 1, better to be above 100.

Since it transforms read counts (or bin counts) by a hyperbolic function, smaller numbers (1, 10) are more sensitive for DESeq than larger ones, like 1000.

Because rsamtools::summary gives read density in the bins, the counts are normally below 1. So either multiplying the bin width or any large constant number will be a good idea to avoid artefacts.

cnluzon commented 3 years ago

My idea was to simulate the "real" number of reads using estimated fragment length and bin size but then I thought a constant value independent of length may be better if a BED file is used that is not bins but something else where length may vary. Of course one could argue that in any case if there are large differences in length it would affect results in both ways.

I will use a larger constant for the count matrix to ensure results are more robust. Thanks!

elsasserlab / labcode

Running replicates statistics on bins and bed files #80