DEIB-GECO / GMQL

GMQL - GenoMetric Query Language
http://www.bioinformatics.deib.polimi.it/geco/
Apache License 2.0
18 stars 11 forks source link

Problems with Cover/Histogram #80

Closed albmarch closed 7 years ago

albmarch commented 7 years ago

Hi, I obtained a wrong results with this query:

data = SELECT() data; H = HISTOGRAM(1, ANY) data; C = COVER(1, ANY) data; MATERIALIZE H INTO H; MATERIALIZE C INTO C;

igb Observing the input data, I expect to obtain two regions from the COVER, while in the results there is only a region.

Furthermore, in the output of Histogram there is a coordinate (17800000) that is not present in the input data. On the opposite, an input region stop coordinate (17798000) is not present in the result. Can the problem be caused by the "exchange" of value of the input coordinate?

The correct result of Histogram should be this: 17797617 17797619 1 17797619 17797673 2 17797673 17797704 3 17797704 17797772 4 17797772 17797927 5 17797927 17798000 4 17798000 17798017 3 17798017 17798123 2 17798123 17798216 1 17799230 17799404 1 17799404 17799534 2 17799534 17800080 1

The input dataset is : data.zip

The output of the query are: job_test_cover_alberto_marchesi_20171013_163613_H.zip and job_test_cover_alberto_marchesi_20171013_163613_C.zip

albmarch commented 7 years ago

I add three new examples with the same problem. I find 1 region instead of 3, 4, or 5 regions.

count_3 count_4 count_5

The query is: Count_3 = SELECT() test_cover_count_3; Count_4 = SELECT() test_cover_count_4; Count_5 = SELECT() test_cover_count_5_1;

Count_3_c = COVER(1,ANY; groupby: target) Count_3; Count_4_c = COVER(1,ANY; groupby: target) Count_4; Count_5_c = COVER(1,ANY; groupby: target) Count_5;

H_3 = HISTOGRAM(1,ANY) Count_3_c; H_4 = HISTOGRAM(1,ANY) Count_4_c; H_5 = HISTOGRAM(1,ANY) Count_5_c;

C_3 = COVER(1,ANY) Count_3_c; C_4 = COVER(1,ANY) Count_4_c; C_5 = COVER(1,ANY) Count_5_c;

MATERIALIZE H_3 INTO H_3; MATERIALIZE H_4 INTO H_4; MATERIALIZE H_5 INTO H_5;

MATERIALIZE C_3 INTO C_3; MATERIALIZE C_4 INTO C_4; MATERIALIZE C_5 INTO C_5;

Input datasets: test_cover_count_3.zip, test_cover_count_4.zip test_cover_count_5_1.zip

Histogram outputs: job_test_cover_guest_new1161_20171028_090109_H_3.zip, job_test_cover_guest_new1161_20171028_090109_H_4.zip, job_test_cover_guest_new1161_20171028_090109_H_5.zip

Cover outputs: job_test_cover_guest_new1161_20171028_090109_C_3.zip, job_test_cover_guest_new1161_20171028_090109_C_4.zip, job_test_cover_guest_new1161_20171028_090109_C_5.zip

akaitoua commented 7 years ago

The problem is partially fixed. You will have a marginal error of "one base" for every Region that starts or stops on the bin border (1K border).

marcomass commented 7 years ago

@akaitoua Thank you for looking into this. What do you mean that there will still be the error of "one base"? Do you expect the obtained region to have 1 base more? Or one base less? (when region starts or stops on a bin border)

akaitoua commented 7 years ago

@marcomass, Because of a technical issue in the binning algorithm, i had to fix this issue with the slight error mentioned above. Which means that the missing region (17798000 17798017 3) will be shown as (17798001 17798017 3).