Closed brentp closed 13 years ago
In checking real data, it's clear that if --threshold is much higher than --seed, then the ends of the region will have high p-values. This lowers the overall significance. It might be good to trim the regions in peaks.py to the edge of the last value > --seed.
Much of this is implemented as of: 32b4692e25709
Done.
There is no multiple-testing correction on the p-values for the regions. rpsim can do the sidak correction correctly because it knows the entire region length (sum of coverage in -p argument)
so sidak is
where k is the number of possible regions of that length:
total_coverage_bp is the total number of bases covered by region to -p. region_length_bp is simply the
end - start
for the region. Probably do this as:Since they are sorted, can also use less memory using groupby on chrom and calculating the coverage per-chrom...