CostaLab / reg-gen

Regulatory Genomics Toolbox: Python library and set of tools for the integrative analysis of high throughput regulatory genomics data.
https://reg-gen.readthedocs.io/
Other
101 stars 30 forks source link

Refactor GenomicRegionSet IO handling #29

Open fabio-t opened 6 years ago

fabio-t commented 6 years ago

IO read/write functions should be separated from the actual GRS. This will mean extracting all read_bed, write_bed etc functions and putting them into Format classes that will take a GRS as input and populate it or write it to file as needed.

The following basic classes should be developed, at least:

To leave for later: improve memory footprint of GenomicRegion so that GRS can be much bigger. Also possibly substitute the internal list for a proper array, to make removal O(1).

fabio-t commented 6 years ago

As @jovesus pointed out, when a GRS is filled from a bed file it's always sorted. Also, duplicate lines are always kept. These two things were there before but I'm not sure if they should be like this.

In general, a GenomicRegionSet is not really a Set. It's wrapper around a List, with List semantics. Just a little quirk.

fabio-t commented 6 years ago

The basic idea is done. BigBed is still not supported since we have to decide how to handle it. Various ways available: