DEIB-GECO / GMQL

GMQL - GenoMetric Query Language
http://www.bioinformatics.deib.polimi.it/geco/
Apache License 2.0
18 stars 11 forks source link

SELECT behavior #97

Open pp86 opened 6 years ago

pp86 commented 6 years ago

Please consider this query:


DATA_SET_VAR = SELECT(provider == "Campaner") HG19_BED_ANNOTATION;

MATERIALIZE DATA_SET_VAR into test_select;

It returns two samples. Btw, no sample with provider = "Campaner" exists. The two output samples have the metadata original_provider = "Campaner".

What happens is that the SELECT checks is the attribute name ends with "provider".

I remember we decided to have a SQL-like approach, i.e. the names are splitted on "." (similar to the "mariage"/"age" example we discussed). So the results would be correct if the metadata was "original.provider", but not "original_provider".

Is anything changed in the specification, or should this be fixed? @marcomass @sunbrn

sunbrn commented 6 years ago

According to what is written in the documentation, in the SELECT predicate on metadata only metadata exactly equal to what is specified should be matched, e.g., only metadata which are exactly "provider" (without prefix of any kind).

I think the matter with the separator (".") is only relevant when we talk about the default/EXACT/FULL options for semijoin/joinby_/_groupby predicates. I mean that, when using SELECT, we should concern about the separator if we are writing something like SELECT(...; semijoin: FULL(provider) IN DS)

Therefore, @pp86 I would fix the metadata predicate to only match the indicated metadata. @marcomass could you please confirm? Am I missing something?

marcomass commented 6 years ago

I confirm; it must be fixed. Thanks @pp86 for spotting it

marcomass commented 6 years ago

Can this be fixed by @OlgaGorlova ?

pp86 commented 6 years ago

In the case we have a sample with: original.provider = "Campaner" should the query provider == "Campaner" select the sample or not?

According to @sunbrn it should not, right?

@marcomass @OlgaGorlova I am taking care of it.

marcomass commented 6 years ago

The definition in the doc is the following: • metadata_attribute_name: it matches all attributes that are equal to OR end with the specified name; • EXACT(metadata_attribute_name): it matches all attributes that are equal to the specified name (without any prefixes); • FULL(metadata_attribute_name): it matches two attributes if they end with the specified name AND their full names are equal.

So, the query original.provider = "Campaner" should match only attributes original.provider, or *.original.provider not just provider

sunbrn commented 6 years ago

@marcomass but these three options should only be defined for the semijoin predicate of the SELECT operator. @pp86's example is concerning the metadata predicate of the SELECT operator. So, if I understand correctly the documentation, in the case he is mentioning, in which we have a sample with: original.provider = "Campaner" the query provider == "Campaner" should not select it. Or are you saying that the def/EXACT/FULL options should be available also in this part?

Erlaad commented 6 years ago

I do believe that the issue at hand is more simply resolved by changing the way that "by metadata_attribute_name" selection (point 1) works in the above definition by @marcomass.

In particular, checking that the metadata value ends in a certain way poses ambiguity cases such as the one found by @pp86 (cf. example below). What was probably intended is that the last token of the metadata name after splitting according to the '.' separator should match the metadata attribute mentioned in the SELECT clause.

Some examples to clarify. Let the following SELECT be of interest: SELECT(data == 'NICE') SOME_DS (1)

Let S1 have the following metadata: data NICE and S2 have instead wrong_data NICE finally, let S3 have OLD.data NICE for S1, S2 and S3 samples in SOME_DS.

If I use the "ending with" definition as above, all samples will be valid (as the criteria is equivalent to looking for strings in the form "data", where is a wildcard). Instead, if I tokenize metadata with the '.' separator and match only the last element, only S1 and S3 will be chosen.

Note that it is entirely possible that a sample S4 having the reallystupid.data NICE metadata would be selected by (1), so some ambiguity would remain. Perhaps a requirement could be made that dataset do not contain metadata with '.', another separator could be defined or a small subroutine could be added to the uploader that changes all '.' to some other separator.

EDIT: some confusion between dataset and sample

marcomass commented 6 years ago

Thanks @sunbrn for the clarification and to @Erlaad for the examples.

We introduced the: • metadata_attribute_name (default) • EXACT(metadata_attribute_name) • FULL(metadata_attribute_name) to avoid the issue that who write the query should know how previous GMQL operator in the query change the metadata attribute names with prefixing (as some operators do use prefixes with "." separator).
So, already now the option "metadata_attribute_name (default)" is to be intended with regards to prefixes of metadata: attribute_name (i.e., metadata_attribute_name: it matches all attributes that are equal to OR end with the specified name (regardless additional metadata_attribute_name prefixes not explicitly specified)). If it is not working in this way, it must be fixed in this way.
Thus, in @Erlaad example only S1 and S3 must be chosen. Please note that the "LIKE" option, as described in issue #55 is not (yet) implemented.

As @sunbrn clarified, the three options: • metadata_attribute_name (default) • EXACT(metadata_attribute_name) • FULL(metadata_attribute_name) have been defined only for the semijoin predicate of the SELECT operator (and joinby and groupby predicates of the other GMQL operators). This is because we defined such options in the view I described above, when we considered SELECT to be generally the initial statement of a GMQL query, so considering cases where the metadata attribute names have no "." separated prefixes. Yet, SELECT can be for sure useful also applied within a GMQL query, particularly after a MAP operation (which introduces "." separated prefixes). So, to avoid the issue mentioned above, I think it is better to extend the application of these 3 options also to the main predicate of the SELECT, not only to its semijoin part.

Please note that in doing so, the issue pointed out by @Erlaad would not hold anymore, since in the provided example case it would be possible to disambiguate between S1 and S3 using the one of the three options (default, EXACT, FULL), according to the desired behavior.

In summary, according to the issue pointed out initially in this thread by @pp86, SELECT behavior should be fixed as here above mentioned.

Please comment if you see anything unclear/wrong in the above.

pp86 commented 6 years ago

I can do that, but keep in mind that this is a major change (Compiler, Dag, Executor, API, Python, R have to be touched). This will require a deep re-testing of the SELECT .

Since in select we can use AND, OR, NOT to compose complex conditions, EXACT and FULL my not be strictly necessary (although surely helpful).

Just fixing the behaviour to split on "." should be much easier (only the implementation will be touched). I would go with this lighter option if you all agree.

marcomass commented 6 years ago

Yes, the fix of splitting on "." must be done for sure. It must be fixed in a way to act as the default option • metadata_attribute_name: it matches all attributes that are equal to OR end with the specified name; i.e.: in the case we have a sample with: DS1.provider = "Campaner" the query provider == "Campaner" should select the sample, which is how it is working now with regards to the application of a SELECT after a MAP. Please note that this would not avoid current ambiguities without implementing the use also of EXACT and FULL in main SELECT predicate.

I understand the other change (introduction also of EXACT and FULL in metadata main predicate) requires more effort. My opinion is that it would be much better to do it now when we are closing the API implementation and contemporary we are also actively working on closing RGMQL and fixing bugs on pyGMQL. Postponing it would then require much more effort to retaken also the other projects (RGMQL, pyGMQL) once they will be closed. Anyway, I agree this is somehow a lower priority aspect with respect to other issues currently still open, still keeping in mind the ambiguity here above mentioned that it leaves.

pp86 commented 6 years ago

The problem of attribute name splitting has been solved by my last commit.

I still have to work on the EXACT/FULL options.

pp86 commented 6 years ago

I started thinking on EXACT/FULL. I think that in the select we can only have the FULL option. The exact makes no sense, isn't it?

marcomass commented 6 years ago

@pp86 Why do you think so? Can you make examples? Usually EXACT is useful to consider attributes that in the matadata do not have prefixes, discriminating them from other attributes with same name, but different prefixes. For instance this can occur when the user creates attributes during the processing (extend, project) with same name (e.g. regionNum), but in different points of the query flow.

pp86 commented 6 years ago

I am missing the difference between the two. Should the result of FULL(antibody) be different form the result of: EXACT(antibody) ?

marcomass commented 6 years ago

I copy here below the current definition in the manual for metajoin conditions (we will extend it also to select, when implemented).

● In a metajoin condition (i.e., for all operators that include such condition: SELECT has semijoin, DIFFERENCE, MAP, and JOIN have joinby, GROUP, MERGE, and COVER have groupby), different matching options can be used: ○ metadata_attribute_name: it matches all attributes that are equal to OR end with the dot-separated suffix specified name (regardless additional metadata_attribute_name dot-separated prefixes not explicitly specified); ○ EXACT(metadata_attribute_name): it matches all attributes that are equal to the specified name (without any prefixes); ○ FULL(metadata_attribute_name): it matches two attributes if they end with the specified name AND their full names are equal; For instance, if we consider the following attributes:

  1. pref1.pref2.att
  2. pref1.att
  3. att
  4. pref1.att Then: • att matches all of the above attributes; • EXACT(att) matches only attribute 3. (i.e., att); • FULL(att) matches attributes 2. and 4. (i.e., pref1.att).
Erlaad commented 6 years ago

Just to understand: let

I’d like to understand a bit more of the example. Let D be the following dataset:

Sample_ID attribute value S1 att 1 S2 other_ds.att 1 S3 other_ds.yet_another_ds.att 1 S4 other_ds.att 2 S5 att 1 S6 other_ds.att 1 S7 other_ds.yet_another_ds.att 1 S8 other_ds.yet_another_ds.att 2

then, using GROUPBY as an example, the groups formed are as follows:

Is this correct? What happens to leftover samples that are not grouped in EXACT(att)? Are samples in subgroups of FULL(att) then also split in their own groups by attribute, like in my example?

Regards, Stefano Perna, MSc PhD Student Dipartimento di Elettronica, Informazione e Bioingegneria - Politecnico di Milano Skype Contact: stefano.e.perna

On 10 Jan 2018, at 15:22, marcomass notifications@github.com<mailto:notifications@github.com> wrote:

I copy here below the current definition in the manual for metajoin conditions (we will extend it also to select, when implemented).

● In a metajoin condition (i.e., for all operators that include such condition: SELECT has semijoin, DIFFERENCE, MAP, and JOIN have joinby, GROUP, MERGE, and COVER have groupby), different matching options can be used: ○ metadata_attribute_name: it matches all attributes that are equal to OR end with the dot-separated suffix specified name (regardless additional metadata_attribute_name dot-separated prefixes not explicitly specified); ○ EXACT(metadata_attribute_name): it matches all attributes that are equal to the specified name (without any prefixes); ○ FULL(metadata_attribute_name): it matches two attributes if they end with the specified name AND their full names are equal; For instance, if we consider the following attributes:

  1. pref1.pref2.att
  2. pref1.att
  3. att
  4. pref1.att Then: • att matches all of the above attributes; • EXACT(att) matches only attribute 3. (i.e., att); • FULL(att) matches attributes 2. and 4. (i.e., pref1.att).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/DEIB-GECO/GMQL/issues/97#issuecomment-356616338, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AQfxmxuRhxt8jBGlGk6ERsxG8rNu7HaRks5tJMe9gaJpZM4RDnnM.

pp86 commented 6 years ago

You did not understand my point. Definition and example describe metajoin or groupby, where we compare two attribute names in order to tell if they are in the same group. Here we are comparing 1 attribute name and 1 fixed string, so (contrary to what I said previously) only EXACT may help. Am I right?

If not please provide a counter example

sunbrn commented 6 years ago

I think @pp86 is correct. So, if I understand correctly, I think we should be able to handle these cases:

  1. using the default option SELECT =="xxxx" you are matching all samples which have the keys some_prefixes. or just ;
  2. using the option SELECT EXACT(=="xxxx") you are matching all sample which have key ;
  3. using the option SELECT EXACT(some_prefixes.) you are matching all sample which have keys some_prefixes..

Therefore default and EXACT should be enough to handle the requested cases. I see no other case that a potential "FULL" option could cover.

marcomass commented 6 years ago

@sunbrn I agree