Open fabio-t opened 6 years ago
As @jovesus pointed out, when a GRS is filled from a bed file it's always sorted. Also, duplicate lines are always kept. These two things were there before but I'm not sure if they should be like this.
In general, a GenomicRegionSet is not really a Set. It's wrapper around a List, with List semantics. Just a little quirk.
The basic idea is done. BigBed is still not supported since we have to decide how to handle it. Various ways available:
Simple conversion. I already have utility methods in motif analysis to convert from bed to bigbed and viceversa. Every application should know how to change the "score" field to make it fit the 0-1000 range, depending on the meaning such score has. This is simple but yields no advantage.
Make a GRSFileIO.BigBed. Instead of converting bed to big bed and viceversa, this would directly write to/read from BigBed files. It has the advantage that it forces us to stop using the Bed utilities (or write a python wrapper), and it should be more efficient than making BED temporary files.
Keep a BigBed behind the GRS. This is a significant change and I'm not sure it's worth it. We would gain a lot by improving the memory efficiency of the GenomicRegion, instead of essentially writing a DB layer on top the BigBeds.
IO read/write functions should be separated from the actual GRS. This will mean extracting all
read_bed
,write_bed
etc functions and putting them into Format classes that will take a GRS as input and populate it or write it to file as needed.The following basic classes should be developed, at least:
BedFormat: it's the current "default" for GRS. They are strongly coupled, and as such it makes harder to export to different formats. This refactoring will solve this problem.
BigBedFormat: it's currently only supported in some of the tools, in a "handcrafted" way. We need a more rational approach for this, especially to support further improvements like having a disk-backed GRS, without loading everything in memory. This would reduce a lot the memory footprint of certain tools (eg, motif analysis).
Bed12Format: a more complicate "bed-like" format relevant for, I believe, only RGT-Viz.
To leave for later: improve memory footprint of GenomicRegion so that GRS can be much bigger.
Also possibly substitute the internal list for a proper array, to make removalO(1)
.