DEIB-GECO / GMQL

GMQL - GenoMetric Query Language
http://www.bioinformatics.deib.polimi.it/geco/
Apache License 2.0
18 stars 11 forks source link

Project bug when a coordinate is modified #48

Closed lucananni93 closed 7 years ago

lucananni93 commented 7 years ago

Example query:

DATA_SET_VAR = SELECT() HG19_ENCODE_NARROW;
PROJECTED = PROJECT(pvalue; region_update: start AS start + 1) DATA_SET_VAR;
MATERIALIZE PROJECTED INTO RESULT_DS;

The query fails and gives the following exception:

2017-06-21 15:07:32,809 INFO [SelectIMDWithNoIndex$] hg_narrowPeaks Selected: 4
2017-06-21 15:07:32,816 INFO [ProjectRD$] ----------------ProjectRD executing..
2017-06-21 15:07:32,818 INFO [SelectIRD$] ----------------SelectIRD 
2017-06-21 15:07:33,692 WARN [TaskSetManager] Stage 1 contains a task of very large size (444 KB). The maximum recommended task size is 100 KB.
2017-06-21 15:07:35,261 WARN [TaskSetManager] Stage 3 contains a task of very large size (444 KB). The maximum recommended task size is 100 KB.
2017-06-21 15:07:40,301 WARN [TaskSetManager] Lost task 0.0 in stage 7.0 (TID 6, genomic.elet.polimi.it, executor 2): java.lang.ArrayIndexOutOfBoundsException: 6
at it.polimi.genomics.spark.implementation.RegionsOperators.ProjectRD$$anonfun$apply$2$$anonfun$apply$3.apply(ProjectRD.scala:49)
at it.polimi.genomics.spark.implementation.RegionsOperators.ProjectRD$$anonfun$apply$2$$anonfun$apply$3.apply(ProjectRD.scala:49)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at it.polimi.genomics.spark.implementation.RegionsOperators.ProjectRD$$anonfun$apply$2.apply(ProjectRD.scala:49)
at it.polimi.genomics.spark.implementation.RegionsOperators.ProjectRD$$anonfun$apply$2.apply(ProjectRD.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:150)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

This is due to the fact that in it.polimi.genomics.core.DataStructures.IRVariable.PROJECT at the following part

val all_proj_values : Option[List[Int]] =
if (new_projected_values.isDefined) {
          val list = new_projected_values.get
          val new_list =
            if (extended_values.isDefined){
              list ++ ((this.schema.size) to (this.schema.size + extended_values.get.size - 1)).toList
            }
            else {
              list
            }
          Some(new_list)
      } else {
        None
      }

the list variable is updated with the wrong number of schema fileds in the case of only coordinate modification. We must consider also the case in which the extended_values contains also coordinates like start, stop, strand, etc...