DEIB-GECO / GMQL

GMQL - GenoMetric Query Language
http://www.bioinformatics.deib.polimi.it/geco/
Apache License 2.0
18 stars 11 forks source link

DIFFERENCE fixes #8

Closed marcomass closed 7 years ago

marcomass commented 7 years ago

DIFFERENCE:

Please, fix the two above issues and use the below provided query to check them.

TEAD4_rep_broad = SELECT(dataType == 'ChipSeq' AND view == 'Peaks' AND setType == 'exp' AND antibody_target == 'TEAD4' AND (cell == 'ECC-1')) HG19_ENCODE_BROAD; MATERIALIZE TEAD4_rep_broad INTO TEAD4_rep_broad;

HM_TF_rep_broad = SELECT(dataType == 'ChipSeq' AND view == 'Peaks' AND setType == 'exp'; semijoin: cell IN TEAD4_rep_broad) HG19_ENCODE_BROAD; MATERIALIZE HM_TF_rep_broad INTO HM_TF_rep_broad;

HM_TF_0 = DIFFERENCE() TEAD4_rep_broad HM_TF_rep_broad; MATERIALIZE HM_TF_0 INTO HM_TF_0;

HM_TF_0_joinby = DIFFERENCE(JOINBY: cell, antibody_target) TEAD4_rep_broad HM_TF_rep_broad; MATERIALIZE HM_TF_0_joinby INTO HM_TF_0_joinby;

Since DIFFERENCE input left dataset include 2 samples of the same cell and antibody_target, the number of output samples in both HM_TF_0 and HM_TF_0_joinby should be the same ***

Furthermore, since here the input right dataset includes the samples of the input left dataset, the output should include no regions, i.e. it should be empty

@akaitoua

marcomass commented 7 years ago

Please, check current difference implementation, since it alway returns an empty dataset.

akaitoua commented 7 years ago

@marcomass The difference problem is fixed now. The problem was in the ids of the meta data. Now the number of samples in the output will be equal or less than the number of samples in the reference dataset.

Note: The meta data of all the samples of the exp is added to all the samples of the resulting reference.

marcomass commented 7 years ago

@akaitoua Unfortunately I need to reopen this issue. The operator seems now working fine, but the output dataset must include only metadata of the left input dataset, i.e., each output sample should have only the same metadata as the corresponding sample in the left input dataset. You can use the following test query: test = SELECT() testDiff; test_2 = PROJECT(region_update: length AS right-left) test; MATERIALIZE test_2 INTO test_2; mask = SELECT(region: length > 1000) test_2; MATERIALIZE mask INTO mask; test_filt_exact = DIFFERENCE(exact: true) test_2 mask; test_filt = DIFFERENCE() test_2 mask; MATERIALIZE test_filt_exact INTO test_filt_exact; MATERIALIZE test_filt INTO test_filt;