SchulzLab / TEPIC

Annotation of genomic regions using transcription factor binding sites and epigenetic data
MIT License
39 stars 9 forks source link

findBackground Python script #19

Closed ghost closed 7 years ago

ghost commented 7 years ago

Standalone version of the background matching script developed for Python 2.7.9

Just supports region length and GC content as features.

I ran a couple of tests and the performance is roughly as follows: Input: 200.000 DNase regions (Blueprint data), on 23 chromosomes CPUs: 3 timeout: 2 minutes (per chromosome) relaxation: at most 2 pct. points

That gives ~160.000 matched regions in roughly 20 minutes time, so ~2000-4000 matches per minute search time per chromosome. Memory consumption is <= 10G in this scenario.

This is still the original randomized search; using an appropriate index may speed up the process substantially. Also, lowering the memory requirements would be possible at the expense of a longer start-up time.

For all of the above, YMMV applies.