sidak correction in region simulations.

brentp commented 13 years ago

There is no multiple-testing correction on the p-values for the regions. rpsim can do the sidak correction correctly because it knows the entire region length (sum of coverage in -p argument)

so sidak is

sidak_p = 1-(1 - p)^k

where k is the number of possible regions of that length:

k = total_coverage_bp / region_length_bp

total_coverage_bp is the total number of bases covered by region to -p. region_length_bp is simply the end - start for the region. Probably do this as:

coverage = collections.defaultdict(set)

for chrom, start, end in regions:
    coverage[chrom].update(range(start, end))

total_coverage_bp = sum(len(rset) for rset in coverage.itervalues())

Since they are sorted, can also use less memory using groupby on chrom and calculating the coverage per-chrom...

brentp commented 13 years ago

In checking real data, it's clear that if --threshold is much higher than --seed, then the ends of the region will have high p-values. This lowers the overall significance. It might be good to trim the regions in peaks.py to the edge of the last value > --seed.

brentp commented 13 years ago

Much of this is implemented as of: 32b4692e25709

brentp commented 13 years ago

Done.

brentp / combined-pvalues

sidak correction in region simulations. #2