DEIB-GECO / GMQL

GMQL - GenoMetric Query Language
http://www.bioinformatics.deib.polimi.it/geco/
Apache License 2.0
18 stars 11 forks source link

Manage Null values in numeric region attributes. #14

Closed marcomass closed 7 years ago

marcomass commented 7 years ago

Manage presence of null values in numeric region attributes.

o Region fields (both string and numeric) maybe NULL. NULL values are not considered for aggregate or tuple functions; Boolean predicates on NULL fields are always false. o Implementation: GMQL has the GNull data type; the implementation of nodes were the computation may have to deal with NULL values should be changed (e.g., adding a pre-filtering)

marcomass commented 7 years ago

Particularly, test that the "null" values, e.g. introduced by the UNION() operation when applied on two datasets with different schema, are correctly managed.

Verify that presence of null values (i.e. values "null" or empty - "") does not alter the correct calculation of aggregate functions (e.g. AVERAGE) on numeric region attributes including such null values. Provide here the GMQL query(s) used for such testing.

akaitoua commented 7 years ago

@OlgaGorlova, please merge your branch and label this as test and close it.

marcomass commented 7 years ago

@OlgaGorlova Hi Olya, Is the average issue regarding null values fixed now? If yes, please confirm, if not please reopen this issue.

OlgaGorlova commented 7 years ago

Hi @marcomass , Yes, it is fixed now.

Erlaad commented 7 years ago

Tested and successfully fixed.

marcomass commented 7 years ago

Tested by Stefano P (Erlaad) with the following query: RAW = SELECT(clinical_follow_uptumor_status == 'with tumor' AND manually_curateddataType == 'dnamethylation27' AND clinical_follow_up__new_tumor_event_type == 'distant metastasis') HG19_TCGA_dnamethylation; TEST = COVER(1,ANY; aggregate: new_beta_value AS AVG(beta_value)) RAW;

MATERIALIZE RAW into raw; MATERIALIZE TEST into test;

Erlaad commented 7 years ago

Errata corrige: the above query doesn't work because of issue related with #61 . However, I have a new query which I tested and works: A = SELECT(assay == "ChIP-seq" AND biosample_term_name == "HepG2" AND experiment_target == "CEBPZ-human") HG19_ENCODE_NARROW_AUG_2017; A1 = PROJECT(region_update: new_field AS null(INTEGER)) A; A2 = PROJECT(region_update: new_field AS 2) A; B = UNION() A1 A2; B1 = COVER(1,ANY; aggregate: new_field_avg AS AVG(new_field)) B; MATERIALIZE B into raw; MATERIALIZE B1 into test;