DEIB-GECO / GMQL

GMQL - GenoMetric Query Language
http://www.bioinformatics.deib.polimi.it/geco/
Apache License 2.0
18 stars 11 forks source link

MERGE error when merging several samples #62

Closed marcomass closed 7 years ago

marcomass commented 7 years ago

In the following query MERGE is applied on a dataset of 89 samples and gives the below error (with lower number of samples it works):

Data = SELECT(Assay == "ChIP-seq" AND Biosample_term_name == "H1-hESC" AND Output_type == "peaks") HG19_ENCODE_NARROW_MAY_2017; Merged = MERGE() Data; MATERIALIZE Merged INTO Merged;

logjob_test_merge_guest_new671_20170801_180015 .... 2017-08-01 18:42:43,584 ERROR [GMQLSparkExecutor] Job aborted due to stage failure: ResultStage 14 (saveAsHadoopDataset at writeMultiOutputFiles.scala:86) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 7 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convert

marcomass commented 7 years ago

The same issue applies to the following query with a COVER over 89 samples (to be used for further testing):

Data = SELECT(Assay == "ChIP-seq" AND Biosample_term_name == "H1-hESC" AND Output_type == "peaks") HG19_ENCODE_NARROW_MAY_2017; Covered = COVER(1,ANY; groupby: Experiment_target) Data; MATERIALIZE Covered INTO Covered;

Without groupby option, it runs and ends correctly, but with groupby option it produces the here below error.

2017-08-02 15:06:19,189 WARN [TaskSetManager] Lost task 24.0 in stage 29.0 (TID 292, genomic.elet.polimi.it, executor 1): java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:189)

2017-08-02 15:06:20,543 ERROR [YarnScheduler] Lost executor 1 on genomic.elet.polimi.it: Container marked as failed: container_1497541929260_0599_01_000002 on host: genomic.elet.polimi.it. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 Killed by external signal 2017-08-02 15:29:03,539 ERROR [TransportResponseHandler] Still have 1 requests outstanding when connection from /131.175.120.18:42427 is closed

2017-08-02 15:29:03,549 ERROR [ContextCleaner] Error cleaning broadcast 16 org.apache.spark.SparkException: Exception thrown in awaitResult

2017-08-02 15:43:57,970 ERROR [GMQLSparkExecutor] Job aborted due to stage failure: ResultStage 29 (saveAsHadoopDataset at writeMultiOutputFiles.scala:86) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: /home/tmp/hadoop-hduser/nm-local-dir/usercache/tomcat7/appcache/application_1497541929260_0599/blockmgr-25fde149-b6b5-4cab-b9f1-d0f9d5f4c29b/3a/shuffle_16_9_0.index (No such file or directory) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)

marcomass commented 7 years ago

@akaitoua
I tested and unfortunately both example queries still give the same error. Please try to fix the issue as you illustrated, so to eliminate its cause and avoid all related issues altogether.

akaitoua commented 7 years ago

@marcomass, the sorting at the end of each job, kills the job when the files are too big to fit in memory. I took of the fancy sorting of the output samples. This will fix the problem. Currently i count find a solution to put the sorting back without memory problems.

marcomass commented 7 years ago

@akaitoua I noticed that now the metadata of each sample are not anymore ordered; can you put back also such ordering, which is needed to allow the evaluation of the obtained metadata?