DEIB-GECO / GMQL

GMQL - GenoMetric Query Language
http://www.bioinformatics.deib.polimi.it/geco/
Apache License 2.0
18 stars 11 forks source link

SemiJoin through LEFT_DISTINCT and RIGHT_DISTINCT and BOTH #53

Closed marcomass closed 6 years ago

marcomass commented 7 years ago

Define and implement SEMIJOIN. [In V1, the modifier project_*_distinct was available to be used with left / right modifiers to eliminate the artifacts regions generated by the JOIN when used with left and right as semijoin. Such modifiers are not available in V2, where artifact regions cannot be eliminated (the workaround of using MAP + SELECT can work only in the case of distance < 0)]

akaitoua commented 7 years ago

@marcomass, I need an example and data if possible.

akaitoua commented 7 years ago

implemented. LEFT_DISTINCT, RIGHT_DISTINCT

marcomass commented 7 years ago

Unfortunately I need to reopen this issue since there are several aspects to fix for LEFT_DISTINCT and RIGHT_DISTINCT @pp86

@akaitoua

You can use the following query for testing: PROM = SELECT(annotation_type == "promoter") HG19_BED_ANNOTATION; TSS = SELECT(annotation_type == "TSS") HG19_BED_ANNOTATION; PROM_TSS = JOIN(DL(0); output: LEFT) PROM TSS; MATERIALIZE PROM_TSS INTO PROM_TSS; TSS_PROM = JOIN(DL(0); output: RIGHT) PROM TSS; MATERIALIZE TSS_PROM INTO TSS_PROM;

PROM_TSSd = JOIN(DL(0); output: LEFT_DISTINCT) PROM TSS; MATERIALIZE PROM_TSSd INTO PROM_TSSd; TSS_PROMd = JOIN(DL(0); output: RIGHT_DISTINCT) PROM TSS; MATERIALIZE TSS_PROMd INTO TSS_PROMd;

PROM_TSSboth = JOIN(DL(0); output: BOTH) PROM TSS; MATERIALIZE PROM_TSSboth INTO PROM_TSSboth;

akaitoua commented 7 years ago

@marcomass, I added the functionality to consider left metadata when left_distinct is selected and for the right_distinct we leave the right one.

marcomass commented 7 years ago

@akaitoua Please also remove prefixes from output metadata in case of RIGHT_DISTINCT and LEFT_DISTINCT (output metadata should be equal to the ones of the RIGHT/LEFT input dataset sample, without prefixes)

@pp86 Please fix output schema in case of LEFT/RIGHT_DISTINCT or BOTH output option, as specified in the above comment

akaitoua commented 7 years ago

removed prefixes from output metadata in case of RIGHT_DISTINCT and LEFT_DISTINCT

pp86 commented 7 years ago

@marcomass @akaitoua

With the BOTH, in which position are the right dataset coordinate copied? prepended or appended? Is the chr copied as well?

akaitoua commented 7 years ago

@pp86 RefArray ++ Array[GValue](expChrom, expStart, expStop, expStrand) ++ expArray This is the new schema of the values.