Closed marcomass closed 7 years ago
@marcomass, I need an example and data if possible.
implemented. LEFT_DISTINCT, RIGHT_DISTINCT
Unfortunately I need to reopen this issue since there are several aspects to fix for LEFT_DISTINCT and RIGHT_DISTINCT @pp86
the schema in output when LEFT_DISTINCT or RIGHT_DISTINCT are used must be only the schema of the LEFT or RIGHT input dataset, respectively (now it is a composition of the two as in LEFT and RIGHT).
the schema in output when BOTH is used must be consistent with the content of the generate data sample (now it is as when LEFT is used, whereas it must include also RIGHT input dataset coordinates).
@akaitoua
You can use the following query for testing: PROM = SELECT(annotation_type == "promoter") HG19_BED_ANNOTATION; TSS = SELECT(annotation_type == "TSS") HG19_BED_ANNOTATION; PROM_TSS = JOIN(DL(0); output: LEFT) PROM TSS; MATERIALIZE PROM_TSS INTO PROM_TSS; TSS_PROM = JOIN(DL(0); output: RIGHT) PROM TSS; MATERIALIZE TSS_PROM INTO TSS_PROM;
PROM_TSSd = JOIN(DL(0); output: LEFT_DISTINCT) PROM TSS; MATERIALIZE PROM_TSSd INTO PROM_TSSd; TSS_PROMd = JOIN(DL(0); output: RIGHT_DISTINCT) PROM TSS; MATERIALIZE TSS_PROMd INTO TSS_PROMd;
PROM_TSSboth = JOIN(DL(0); output: BOTH) PROM TSS; MATERIALIZE PROM_TSSboth INTO PROM_TSSboth;
@marcomass, I added the functionality to consider left metadata when left_distinct is selected and for the right_distinct we leave the right one.
@akaitoua Please also remove prefixes from output metadata in case of RIGHT_DISTINCT and LEFT_DISTINCT (output metadata should be equal to the ones of the RIGHT/LEFT input dataset sample, without prefixes)
@pp86 Please fix output schema in case of LEFT/RIGHT_DISTINCT or BOTH output option, as specified in the above comment
removed prefixes from output metadata in case of RIGHT_DISTINCT and LEFT_DISTINCT
@marcomass @akaitoua
With the BOTH, in which position are the right dataset coordinate copied? prepended or appended? Is the chr copied as well?
@pp86 RefArray ++ Array[GValue](expChrom, expStart, expStop, expStrand) ++ expArray This is the new schema of the values.
Define and implement SEMIJOIN. [In V1, the modifier project_*_distinct was available to be used with left / right modifiers to eliminate the artifacts regions generated by the JOIN when used with left and right as semijoin. Such modifiers are not available in V2, where artifact regions cannot be eliminated (the workaround of using MAP + SELECT can work only in the case of distance < 0)]