Signal data format support (e.g., BigWig)

DEIB-GECO / GMQL

GMQL - GenoMetric Query Language

http://www.bioinformatics.deib.polimi.it/geco/

Apache License 2.0

18 stars 11 forks source link

Signal data format support (e.g., BigWig) #56

Open marcomass opened 7 years ago

marcomass commented 7 years ago

Enable the use of signal data sample (e.g. BigWig) in some operands, e.g., as second operand of MAP (or in COVER, to be discussed). Possibly/probably a specific "special" versio,n of the defined MAP operator could be better.
Examples of BigWig files (from 0.6 to 1.5 GB) are available at https://www.encodeproject.org/experiments/ENCSR620VIC/

akaitoua commented 7 years ago

@marcomass, As i understand that BigWig is a binary indexed version of Wiggle format. And Wiggle format is compressed, less accurate, version of BedGraph. Why do not we use BedGraph and always convert BigWig and Wig files to BedGraph ?

marcomass commented 7 years ago

@akaitoua You are right that BigWig can be converted in BedGraph (so BigWig is not less accurate than BedGraph). Yet, BedGraph takes much more space than BigWig, so nobody use it, and all use BigWig, as in the provided link.

In any case, this issue regards two aspects:

the format of the input data (BigWig)
the efficiency of the MAP operation when signal data (BigWig or BedGraph) are used as experiment dataset. You could postpone the aspect 1. and check before the aspect 2 (by converting before BigWig to BedGraph); yet, I think they are connected, since possibly BigWig enables to access specific portions of the file directly, without scanning it entirely. This would enable to access only the portion of the BigWig file related to the reference regions in the reference dataset of the MAP, improving performances. Of course this requires a new, specific, MAP for experiment datasets of signal data, to process this kind of data differently from the bed ones.

akaitoua commented 7 years ago

@marcomass, I check it and these are more details. I suggests to support only BigGraph format since it does not change our data model. So when ever we copy data into GMQL we change the format to what we call GMQL_WIG => which is a BEDGraph but in columnar format, which is binary that GMQL can read and small in Size in fomparison to BEDGraph. Then we store GMQL_WIG in our repository.

Why not BIGWIG and WIG for GMQL, is because we are performing different type of queries than the others in the field. We are performing always a full join between the reference and the experiment (set of regions in the reference almost equal size to the experiment sample). In case we will start supporting an interval joins (which is like selecting small portion of the BIGWIG file) then it is better to change GMQL to index which will be faster in this case.

marcomass commented 7 years ago

@akaitoua Ok. How can we change the format to GMQL_WIG when copying data into GMQL? What has to be the input format of this transformation? BEDGraph?

Do you think that using BEDGraph (or GMQL_WIG) as an experiment dataset of a MAP using genes as reference regions in the reference dataset (thus, about 25000 for human) would be handle by the current system with reasonable performance?