DEIB-GECO / GMQL

GMQL - GenoMetric Query Language
http://www.bioinformatics.deib.polimi.it/geco/
Apache License 2.0
18 stars 11 forks source link

BAG - ordered items #26

Closed marcomass closed 7 years ago

marcomass commented 7 years ago

The BAG aggregation operator should include only distinct items (not all as now) from the aggregated elements, and list such items ordered.

akaitoua commented 7 years ago

This modification will result in lower performance. The BAG aggregation function will be changed from being associative function to nonAssociative function. Associative functions can be implemented as a map reduce procedures while the nonAssociative functions needs all the bad in an input list. Thus, for associative, there is no memory needed to hold a list of items of the bag and sort them. @marcomass if this is really needed, the implementation will change accordingly.

akaitoua commented 7 years ago

@marcomass , I had a simple fix for having distinct, without changing the function to nonAssociative. This will not always give exact distinct but will relax the output.

marcomass commented 7 years ago

@akaitoua Is this already committed?

marcomass commented 7 years ago

I tested the fix and unfortunately I found the issue not solved yet (thus I had to reopen the issue). Here is a query to test it: BROAD = SELECT(assay == 'ChIP-seq' AND output_type == 'peaks' AND biosample_term_name == 'K562' AND experiment_target == 'CTCF-human' AND experiment_accession == 'ENCSR000AKO') HG19_ENCODE_BROAD_NOV_2016; PROM = SELECT(annotation_type == 'promoter' AND original_provider == 'Campaner') HG19_BED_ANNOTATION; RES_0 = MAP() PROM BROAD; RES_1 = SELECT(region: count_PROM_BROAD > 0) RES_0; RES = EXTEND(EntrezGeneID AS BAG(name)) RES_1; MATERIALIZE RES INTO RES;

marcomass commented 7 years ago

Additionally, values in bag output should be comma separated, and bag does not collect null values as it should. The following query can be used for testing these aspects: DS1 = SELECT(assay == "ChIP-seq" AND file_accession == "ENCFF001USF") HG19_ENCODE_NARROW_NOV_2016; DS2 = SELECT(assay == "ChIP-seq" AND file_accession == "ENCFF001SST") HG19_ENCODE_BROAD_NOV_2016; DSU = UNION() DS1 DS2; DSUC = COVER(1, ANY; aggregate: bagsig AS BAG(signal), bagpeakname AS BAG(peak)) DSU; MATERIALIZE DSU INTO DSU; MATERIALIZE DSUC INTO DSUC;

marcomass commented 7 years ago

@akaitoua I tested the BAG on null values using the above provided query and I got the following error: 2017-09-07 00:31:49,839 ERROR [Executor] Exception in task 7.0 in stage 79.0 (TID 5517) java.lang.ClassCastException: it.polimi.genomics.core.GNull cannot be cast to it.polimi.genomics.core.GDouble at it.polimi.genomics.core.GValue$class.compare(DataTypes.scala:80) at it.polimi.genomics.core.GNull.compare(DataTypes.scala:177)

It seams there is some issue with the null values ... can you please check?
The other aspects of this issue seam fixed; should we reopen it?

akaitoua commented 7 years ago

@marcomass, fixed.

marcomass commented 7 years ago

@akaitoua Unfortunately running the same query (above) the same error is still present (see below). Please fix. I reopen the issue.

2017-09-07 19:13:26,470 ERROR [Executor] Exception in task 3.0 in stage 79.0 (TID 5513) java.lang.ClassCastException: it.polimi.genomics.core.GNull cannot be cast to it.polimi.genomics.core.GDouble at it.polimi.genomics.core.GValue$class.compare(DataTypes.scala:80) at it.polimi.genomics.core.GNull.compare(DataTypes.scala:177) at it.polimi.genomics.core.GNull.compare(DataTypes.scala:177) at scala.math.Ordered$class.compareTo(Ordered.scala:92) at it.polimi.genomics.core.GNull.compareTo(DataTypes.scala:177) at scala.math.LowPriorityOrderingImplicits$$anon$6.compare(Ordering.scala:150) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:351) at java.util.TimSort.sort(TimSort.java:216) at java.util.Arrays.sort(Arrays.java:1438) at scala.collection.SeqLike$class.sorted(SeqLike.scala:648) at scala.collection.AbstractSeq.sorted(Seq.scala:41) at it.polimi.genomics.GMQLServer.DefaultRegionsToRegionFactory$$anon$7$$anonfun$20.apply(DefaultRegionsToRegionFactory.scala:135)

marcomass commented 7 years ago

@akaitoua @acanakoglu Arif mentioned that in some cases (e.g., when BAG is used in MAP) it would be useful to have all values reported by BAG, not only the distinct ones (which is useful in other cases, e.g., when BAG is used in EXTEND). Could you provide two versions of bag, exactly with the same functionality, but one named BAGD returning distinct values, and another named BAG returning all values?

akaitoua commented 7 years ago

@marcomass @acanakoglu , Such BAG function can be added but it needs compiler changes.