cupuyc / scala-kaggle

Scala Spark code for Kaggle competitions
6 stars 7 forks source link

Runtime error while running #1

Open dazzyduck opened 8 years ago

dazzyduck commented 8 years ago

Hi ,

   I am getting the following errors while compiling your code for Expedia Hotel Recommendation system.

seems like there is some error in the submission line. Could you please check.

val test_union = selectWithRowNumber(test_join_1
  .unionAll(test_join_2)
  .unionAll(test_join_3)
  .unionAll(test_remainder), w4, RN_ALL, false)
  .orderBy(ID, RN_ALL)

// test_union.show(5)

val submission = test_union .orderBy(ID, RN_ALL) .rdd .map(x => { (x.getInt(0), List(x.getInt(1))) })** // .map(x => (x.id, [x.hotel_cluster,])) .reduceByKey((a, b) => a ++ b) .mapValues(x => (x ++ top5_bc.value).take(5)) .mapValues(x => x.take(5).mkString(" ")) .map(x => Row(x._1, x._2))

16/06/09 01:08:06 INFO GenerateUnsafeProjection: Code generated in 154.744507 ms 16/06/09 01:08:06 ERROR Executor: Exception in task 0.0 in stage 18.0 (TID 1418) java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:454) at java.lang.Integer.parseInt(Integer.java:527) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:61) at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194) at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryColumnarTableScan.scala:169) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 16/06/09 01:08:06 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 1419) java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:454) at java.lang.Integer.parseInt(Integer.java:527) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:61) at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194) at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryColumnarTableScan.scala:169) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 16/06/09 01:08:06 INFO TaskSetManager: Starting task 2.0 in stage 18.0 (TID 1420, localhost, partition 2,PROCESS_LOCAL, 2130 bytes) 16/06/09 01:08:06 INFO Executor: Running task 2.0 in stage 18.0 (TID 1420) 16/06/09 01:08:06 WARN TaskSetManager: Lost task 1.0 in stage 18.0 (TID 1419, localhost): java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:454) at java.lang.Integer.parseInt(Integer.java:527) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:61) at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194) at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryColumnarTableScan.scala:169) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

16/06/09 01:08:06 ERROR TaskSetManager: Task 1 in stage 18.0 failed 1 times; aborting job 16/06/09 01:08:06 INFO TaskSetManager: Lost task 0.0 in stage 18.0 (TID 1418) on executor localhost: java.lang.NumberFormatException (null) [duplicate 1] 16/06/09 01:08:06 INFO TaskSchedulerImpl: Cancelling stage 16 16/06/09 01:08:06 INFO CacheManager: Partition rdd_116_2 not found, computing it 16/06/09 01:08:06 INFO HadoopRDD: Input split: file:/home/xyz/expedia/test.csv:67108864+33554432 16/06/09 01:08:06 ERROR Executor: Exception in task 2.0 in stage 18.0 (TID 1420) java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:454) at java.lang.Integer.parseInt(Integer.java:527) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:61) at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194) at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryColumnarTableScan.scala:169) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 16/06/09 01:08:06 INFO Executor: Executor is trying to kill task 121.0 in stage 16.0 (TID 1417) 16/06/09 01:08:06 INFO Executor: Executor is trying to kill task 120.0 in stage 16.0 (TID 1416) 16/06/09 01:08:06 INFO TaskSchedulerImpl: Stage 16 was cancelled 16/06/09 01:08:06 INFO DAGScheduler: ShuffleMapStage 16 (persist at expedia.scala:171) failed in 487.259 s 16/06/09 01:08:06 INFO TaskSetManager: Lost task 2.0 in stage 18.0 (TID 1420) on executor localhost: java.lang.NumberFormatException (null) [duplicate 2] 16/06/09 01:08:06 INFO TaskSchedulerImpl: Removed TaskSet 18.0, whose tasks have all completed, from pool 16/06/09 01:08:06 WARN TaskMemoryManager: leak 16.5 MB memory from org.apache.spark.unsafe.map.BytesToBytesMap@69166d0a 16/06/09 01:08:06 ERROR Executor: Managed memory leak detected; size = 17301504 bytes, TID = 1416 16/06/09 01:08:06 INFO Executor: Executor killed task 120.0 in stage 16.0 (TID 1416) 16/06/09 01:08:06 INFO TaskSchedulerImpl: Cancelling stage 18 16/06/09 01:08:06 INFO DAGScheduler: ShuffleMapStage 18 (rdd at expedia.scala:146) failed in 487.235 s 16/06/09 01:08:06 WARN TaskMemoryManager: leak 16.3 MB memory from org.apache.spark.unsafe.map.BytesToBytesMap@2d3e13c 16/06/09 01:08:06 ERROR Executor: Managed memory leak detected; size = 17039360 bytes, TID = 1417 16/06/09 01:08:06 INFO Executor: Executor killed task 121.0 in stage 16.0 (TID 1417) 16/06/09 01:08:06 INFO DAGScheduler: Job 8 failed: rdd at expedia.scala:146, took 487.314512 s Exception in thread "main" org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange rangepartitioning(id#150 ASC,rn_all#298 ASC,200), None +- Filter (rn_all#298 <= 5) +- Window [id#150,hotel_cluster#23,rn#31], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() windowspecdefinition(id#150,rn#31 ASC,ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS rn_all#298], [id#150], [rn#31 ASC]

  +- Sort [id#150 ASC,rn#31 ASC], false, 0
     +- TungstenExchange hashpartitioning(id#150,200), None
        +- Union
           :- Sort [id#150 ASC,rn#31 ASC], true, 0
           :  +- ConvertToUnsafe
           :     +- Exchange rangepartitioning(id#150 ASC,rn#31 ASC,200), None
           :        +- ConvertToSafe
           :           +- Project [id#150,hotel_cluster#23,rn#31]
           :              +- SortMergeJoin [user_location_city#156,orig_destination_distance#174], [user_location_city#5,orig_destination_distance#28]
           :                 :- Sort [user_location_city#156 ASC,orig_destination_distance#174 ASC], false, 0
           :                 :  +- TungstenExchange hashpartitioning(user_location_city#156,orig_destination_distance#174,200), None
           :                 :     +- InMemoryColumnarTableScan [id#150,orig_destination_distance#174,user_location_city#156], InMemoryRelation [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,(orig_destination_distance#157 * 100000.0) AS orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], None
           :                 +- InMemoryColumnarTableScan [user_location_city#5,rn#31,orig_destination_distance#28,hotel_cluster#23], InMemoryRelation [user_location_city#5,orig_destination_distance#28,hotel_cluster#23,cnt#30L,rn#31], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
           :- Sort [id#150 ASC,rn#295 ASC], true, 0
           :  +- ConvertToUnsafe
           :     +- Exchange rangepartitioning(id#150 ASC,rn#295 ASC,200), None
           :        +- ConvertToSafe
           :           +- Project [id#150,hotel_cluster#23,(rn#63 * 10) AS rn#295]
           :              +- SortMergeJoin [srch_destination_id#167,hotel_country#172,hotel_market#173], [srch_destination_id#16,hotel_country#21,hotel_market#22]
           :                 :- Sort [srch_destination_id#167 ASC,hotel_country#172 ASC,hotel_market#173 ASC], false, 0
           :                 :  +- TungstenExchange hashpartitioning(srch_destination_id#167,hotel_country#172,hotel_market#173,200), None
           :                 :     +- InMemoryColumnarTableScan [id#150,hotel_country#172,srch_destination_id#167,hotel_market#173], InMemoryRelation [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,(orig_destination_distance#157 * 100000.0) AS orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], None
           :                 +- InMemoryColumnarTableScan [hotel_market#22,hotel_country#21,rn#63,hotel_cluster#23,srch_destination_id#16], InMemoryRelation [srch_destination_id#16,hotel_country#21,hotel_market#22,hotel_cluster#23,sum_wb#62L,rn#63], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
           :- Sort [id#150 ASC,rn#296 ASC], true, 0
           :  +- ConvertToUnsafe
           :     +- Exchange rangepartitioning(id#150 ASC,rn#296 ASC,200), None
           :        +- ConvertToSafe
           :           +- Project [id#150,hotel_cluster#23,(rn#100 * 100) AS rn#296]
           :              +- SortMergeJoin [srch_destination_id#167], [srch_destination_id#16]
           :                 :- Sort [srch_destination_id#167 ASC], false, 0
           :                 :  +- TungstenExchange hashpartitioning(srch_destination_id#167,200), None
           :                 :     +- InMemoryColumnarTableScan [id#150,srch_destination_id#167], InMemoryRelation [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,(orig_destination_distance#157 * 100000.0) AS orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], None
           :                 +- InMemoryColumnarTableScan [rn#100,srch_destination_id#16,hotel_cluster#23], InMemoryRelation [srch_destination_id#16,hotel_cluster#23,sum_wb#99L,rn#100], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
           +- Project [id#150,hotel_cluster#23,999 AS rn#297]
              +- BroadcastNestedLoopJoin BuildRight, Inner, None
                 :- Except
                 :  :- Except
                 :  :  :- Except
                 :  :  :  :- ConvertToSafe
                 :  :  :  :  +- InMemoryColumnarTableScan [id#150], InMemoryRelation [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,(orig_destination_distance#157 * 100000.0) AS orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], None
                 :  :  :  +- ConvertToSafe
                 :  :  :     +- TungstenAggregate(key=[id#150], functions=[], output=[id#150])
                 :  :  :        +- TungstenExchange hashpartitioning(id#150,200), None
                 :  :  :           +- TungstenAggregate(key=[id#150], functions=[], output=[id#150])
                 :  :  :              +- Project [id#150]
                 :  :  :                 +- Sort [id#150 ASC,rn#31 ASC], true, 0
                 :  :  :                    +- ConvertToUnsafe
                 :  :  :                       +- Exchange rangepartitioning(id#150 ASC,rn#31 ASC,200), None
                 :  :  :                          +- ConvertToSafe
                 :  :  :                             +- Project [id#150,rn#31]
                 :  :  :                                +- SortMergeJoin [user_location_city#156,orig_destination_distance#174], [user_location_city#5,orig_destination_distance#28]
                 :  :  :                                   :- Sort [user_location_city#156 ASC,orig_destination_distance#174 ASC], false, 0
                 :  :  :                                   :  +- TungstenExchange hashpartitioning(user_location_city#156,orig_destination_distance#174,200), None
                 :  :  :                                   :     +- InMemoryColumnarTableScan [id#150,orig_destination_distance#174,user_location_city#156], InMemoryRelation [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,(orig_destination_distance#157 * 100000.0) AS orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], None
                 :  :  :                                   +- InMemoryColumnarTableScan [user_location_city#5,rn#31,orig_destination_distance#28], InMemoryRelation [user_location_city#5,orig_destination_distance#28,hotel_cluster#23,cnt#30L,rn#31], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
                 :  :  +- ConvertToSafe
                 :  :     +- TungstenAggregate(key=[id#150], functions=[], output=[id#150])
                 :  :        +- TungstenExchange hashpartitioning(id#150,200), None
                 :  :           +- TungstenAggregate(key=[id#150], functions=[], output=[id#150])
                 :  :              +- Project [id#150]
                 :  :                 +- Sort [id#150 ASC,rn#295 ASC], true, 0
                 :  :                    +- ConvertToUnsafe
                 :  :                       +- Exchange rangepartitioning(id#150 ASC,rn#295 ASC,200), None
                 :  :                          +- ConvertToSafe
                 :  :                             +- Project [id#150,(rn#63 * 10) AS rn#295]
                 :  :                                +- SortMergeJoin [srch_destination_id#167,hotel_country#172,hotel_market#173], [srch_destination_id#16,hotel_country#21,hotel_market#22]
                 :  :                                   :- Sort [srch_destination_id#167 ASC,hotel_country#172 ASC,hotel_market#173 ASC], false, 0
                 :  :                                   :  +- TungstenExchange hashpartitioning(srch_destination_id#167,hotel_country#172,hotel_market#173,200), None
                 :  :                                   :     +- InMemoryColumnarTableScan [id#150,hotel_country#172,srch_destination_id#167,hotel_market#173], InMemoryRelation [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,(orig_destination_distance#157 * 100000.0) AS orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], None
                 :  :                                   +- InMemoryColumnarTableScan [hotel_market#22,hotel_country#21,srch_destination_id#16,rn#63], InMemoryRelation [srch_destination_id#16,hotel_country#21,hotel_market#22,hotel_cluster#23,sum_wb#62L,rn#63], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
                 :  +- ConvertToSafe
                 :     +- TungstenAggregate(key=[id#150], functions=[], output=[id#150])
                 :        +- TungstenExchange hashpartitioning(id#150,200), None
                 :           +- TungstenAggregate(key=[id#150], functions=[], output=[id#150])
                 :              +- Project [id#150]
                 :                 +- Sort [id#150 ASC,rn#296 ASC], true, 0
                 :                    +- ConvertToUnsafe
                 :                       +- Exchange rangepartitioning(id#150 ASC,rn#296 ASC,200), None
                 :                          +- ConvertToSafe
                 :                             +- Project [id#150,(rn#100 * 100) AS rn#296]
                 :                                +- SortMergeJoin [srch_destination_id#167], [srch_destination_id#16]
                 :                                   :- Sort [srch_destination_id#167 ASC], false, 0
                 :                                   :  +- TungstenExchange hashpartitioning(srch_destination_id#167,200), None
                 :                                   :     +- InMemoryColumnarTableScan [id#150,srch_destination_id#167], InMemoryRelation [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,(orig_destination_distance#157 * 100000.0) AS orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], None
                 :                                   +- InMemoryColumnarTableScan [rn#100,srch_destination_id#16], InMemoryRelation [srch_destination_id#16,hotel_cluster#23,sum_wb#99L,rn#100], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
                 +- Limit 5
                    +- ConvertToSafe
                       +- InMemoryColumnarTableScan [hotel_cluster#23], InMemoryRelation [hotel_cluster#23,sum_wb#126L], true, 10000, StorageLevel(true, true, false, true, 1), Sort [sum_wb#126L DESC], true, 0, None

at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:247)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.Sort.doExecute(Sort.scala:64)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.Sort.doExecute(Sort.scala:64)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.rdd$lzycompute(DataFrame.scala:1637)
at org.apache.spark.sql.DataFrame.rdd(DataFrame.scala:1634)
at expedia$.init_data(expedia.scala:146)
at expedia$delayedInit$body.apply(expedia.scala:40)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at expedia$.main(expedia.scala:17)
at expedia.main(expedia.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: TungstenExchange hashpartitioning(id#150,200), None +- Union :- Sort [id#150 ASC,rn#31 ASC], true, 0 : +- ConvertToUnsafe : +- Exchange rangepartitioning(id#150 ASC,rn#31 ASC,200), None : +- ConvertToSafe : +- Project [id#150,hotel_cluster#23,rn#31] : +- SortMergeJoin [user_location_city#156,orig_destination_distance#174], [user_location_city#5,orig_destination_distance#28] : :- Sort [user_location_city#156 ASC,orig_destination_distance#174 ASC], false, 0 : : +- TungstenExchange hashpartitioning(user_location_city#156,orig_destination_distance#174,200), None : : +- InMemoryColumnarTableScan [id#150,orig_destination_distance#174,user_location_city#156], InMemoryRelation [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,(orig_destination_distance#157 * 100000.0) AS orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], None : +- InMemoryColumnarTableScan [user_location_city#5,rn#31,orig_destination_distance#28,hotel_cluster#23], InMemoryRelation [user_location_city#5,orig_destination_distance#28,hotel_cluster#23,cnt#30L,rn#31], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None :- Sort [id#150 ASC,rn#295 ASC], true, 0 : +- ConvertToUnsafe : +- Exchange rangepartitioning(id#150 ASC,rn#295 ASC,200), None : +- ConvertToSafe : +- Project [id#150,hotel_cluster#23,(rn#63 * 10) AS rn#295] : +- SortMergeJoin [srch_destination_id#167,hotel_country#172,hotel_market#173], [srch_destination_id#16,hotel_country#21,hotel_market#22] : :- Sort [srch_destination_id#167 ASC,hotel_country#172 ASC,hotel_market#173 ASC], false, 0 : : +- TungstenExchange hashpartitioning(srch_destination_id#167,hotel_country#172,hotel_market#173,200), None : : +- InMemoryColumnarTableScan [id#150,hotel_country#172,srch_destination_id#167,hotel_market#173], InMemoryRelation [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,(orig_destination_distance#157 * 100000.0) AS orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], None : +- InMemoryColumnarTableScan [hotel_market#22,hotel_country#21,rn#63,hotel_cluster#23,srch_destination_id#16], InMemoryRelation [srch_destination_id#16,hotel_country#21,hotel_market#22,hotel_cluster#23,sum_wb#62L,rn#63], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None :- Sort [id#150 ASC,rn#296 ASC], true, 0 : +- ConvertToUnsafe : +- Exchange rangepartitioning(id#150 ASC,rn#296 ASC,200), None : +- ConvertToSafe : +- Project [id#150,hotel_cluster#23,(rn#100 * 100) AS rn#296] : +- SortMergeJoin [srch_destination_id#167], [srch_destination_id#16] : :- Sort [srch_destination_id#167 ASC], false, 0 : : +- TungstenExchange hashpartitioning(srch_destination_id#167,200), None : : +- InMemoryColumnarTableScan [id#150,srch_destination_id#167], InMemoryRelation [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,(orig_destination_distance#157 * 100000.0) AS orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], None : +- InMemoryColumnarTableScan [rn#100,srch_destination_id#16,hotel_cluster#23], InMemoryRelation [srch_destination_id#16,hotel_cluster#23,sum_wb#99L,rn#100], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None +- Project [id#150,hotel_cluster#23,999 AS rn#297] +- BroadcastNestedLoopJoin BuildRight, Inner, None :- Except : :- Except : : :- Except : : : :- ConvertToSafe : : : : +- InMemoryColumnarTableScan [id#150], InMemoryRelation [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,(orig_destination_distance#157 * 100000.0) AS orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], None : : : +- ConvertToSafe : : : +- TungstenAggregate(key=[id#150], functions=[], output=[id#150]) : : : +- TungstenExchange hashpartitioning(id#150,200), None : : : +- TungstenAggregate(key=[id#150], functions=[], output=[id#150]) : : : +- Project [id#150] : : : +- Sort [id#150 ASC,rn#31 ASC], true, 0 : : : +- ConvertToUnsafe : : : +- Exchange rangepartitioning(id#150 ASC,rn#31 ASC,200), None : : : +- ConvertToSafe : : : +- Project [id#150,rn#31] : : : +- SortMergeJoin [user_location_city#156,orig_destination_distance#174], [user_location_city#5,orig_destination_distance#28] : : : :- Sort [user_location_city#156 ASC,orig_destination_distance#174 ASC], false, 0 : : : : +- TungstenExchange hashpartitioning(user_location_city#156,orig_destination_distance#174,200), None : : : : +- InMemoryColumnarTableScan [id#150,orig_destination_distance#174,user_location_city#156], InMemoryRelation [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,(orig_destination_distance#157 * 100000.0) AS orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], None : : : +- InMemoryColumnarTableScan [user_location_city#5,rn#31,orig_destination_distance#28], InMemoryRelation [user_location_city#5,orig_destination_distance#28,hotel_cluster#23,cnt#30L,rn#31], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None : : +- ConvertToSafe : : +- TungstenAggregate(key=[id#150], functions=[], output=[id#150]) : : +- TungstenExchange hashpartitioning(id#150,200), None : : +- TungstenAggregate(key=[id#150], functions=[], output=[id#150]) : : +- Project [id#150] : : +- Sort [id#150 ASC,rn#295 ASC], true, 0 : : +- ConvertToUnsafe : : +- Exchange rangepartitioning(id#150 ASC,rn#295 ASC,200), None : : +- ConvertToSafe : : +- Project [id#150,(rn#63 * 10) AS rn#295] : : +- SortMergeJoin [srch_destination_id#167,hotel_country#172,hotel_market#173], [srch_destination_id#16,hotel_country#21,hotel_market#22] : : :- Sort [srch_destination_id#167 ASC,hotel_country#172 ASC,hotel_market#173 ASC], false, 0 : : : +- TungstenExchange hashpartitioning(srch_destination_id#167,hotel_country#172,hotel_market#173,200), None : : : +- InMemoryColumnarTableScan [id#150,hotel_country#172,srch_destination_id#167,hotel_market#173], InMemoryRelation [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,(orig_destination_distance#157 * 100000.0) AS orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], None : : +- InMemoryColumnarTableScan [hotel_market#22,hotel_country#21,srch_destination_id#16,rn#63], InMemoryRelation [srch_destination_id#16,hotel_country#21,hotel_market#22,hotel_cluster#23,sum_wb#62L,rn#63], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None : +- ConvertToSafe : +- TungstenAggregate(key=[id#150], functions=[], output=[id#150]) : +- TungstenExchange hashpartitioning(id#150,200), None : +- TungstenAggregate(key=[id#150], functions=[], output=[id#150]) : +- Project [id#150] : +- Sort [id#150 ASC,rn#296 ASC], true, 0 : +- ConvertToUnsafe : +- Exchange rangepartitioning(id#150 ASC,rn#296 ASC,200), None : +- ConvertToSafe : +- Project [id#150,(rn#100 * 100) AS rn#296] : +- SortMergeJoin [srch_destination_id#167], [srch_destination_id#16] : :- Sort [srch_destination_id#167 ASC], false, 0 : : +- TungstenExchange hashpartitioning(srch_destination_id#167,200), None : : +- InMemoryColumnarTableScan [id#150,srch_destination_id#167], InMemoryRelation [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,(orig_destination_distance#157 * 100000.0) AS orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], None : +- InMemoryColumnarTableScan [rn#100,srch_destination_id#16], InMemoryRelation [srch_destination_id#16,hotel_cluster#23,sum_wb#99L,rn#100], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None +- Limit 5 +- ConvertToSafe +- InMemoryColumnarTableScan [hotel_cluster#23], InMemoryRelation [hotel_cluster#23,sum_wb#126L], true, 10000, StorageLevel(true, true, false, true, 1), Sort [sum_wb#126L DESC], true, 0, None

at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:247)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.Sort.doExecute(Sort.scala:64)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.Window.doExecute(Window.scala:245)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.Filter.doExecute(basicOperators.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.Exchange.prepareShuffleDependency(Exchange.scala:164)
at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:254)
at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:248)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
... 40 more

Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange rangepartitioning(id#150 ASC,rn#31 ASC,200), None +- ConvertToSafe +- Project [id#150,hotel_cluster#23,rn#31] +- SortMergeJoin [user_location_city#156,orig_destination_distance#174], [user_location_city#5,orig_destination_distance#28] :- Sort [user_location_city#156 ASC,orig_destination_distance#174 ASC], false, 0 : +- TungstenExchange hashpartitioning(user_location_city#156,orig_destination_distance#174,200), None : +- InMemoryColumnarTableScan [id#150,orig_destination_distance#174,user_location_city#156], InMemoryRelation [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#150,date_time#151,site_name#152,posa_continent#153,user_location_country#154,user_location_region#155,user_location_city#156,(orig_destination_distance#157 * 100000.0) AS orig_destination_distance#174,user_id#158,is_mobile#159,is_package#160,channel#161,srch_ci#162,srch_co#163,srch_adults_cnt#164,srch_children_cnt#165,srch_rm_cnt#166,srch_destination_id#167,srch_destination_type_id#168,is_booking#169,cnt#170,hotel_continent#171,hotel_country#172,hotel_market#173], None +- InMemoryColumnarTableScan [user_location_city#5,rn#31,orig_destination_distance#28,hotel_cluster#23], InMemoryRelation [user_location_city#5,orig_destination_distance#28,hotel_cluster#23,cnt#30L,rn#31], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None

at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:247)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.Sort.doExecute(Sort.scala:64)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.Union$$anonfun$doExecute$1.apply(basicOperators.scala:144)
at org.apache.spark.sql.execution.Union$$anonfun$doExecute$1.apply(basicOperators.scala:144)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.execution.Union.doExecute(basicOperators.scala:144)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.Exchange.prepareShuffleDependency(Exchange.scala:164)
at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:254)
at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:248)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
... 64 more

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 18.0 failed 1 times, most recent failure: Lost task 1.0 in stage 18.0 (TID 1419, localhost): java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:454) at java.lang.Integer.parseInt(Integer.java:527) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:61) at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194) at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryColumnarTableScan.scala:169) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.collect(RDD.scala:926) at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:264) at org.apache.spark.RangePartitioner.(Partitioner.scala:126) at org.apache.spark.sql.execution.Exchange.prepareShuffleDependency(Exchange.scala:179) at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:254) at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:248) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48) ... 95 more Caused by: java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:454) at java.lang.Integer.parseInt(Integer.java:527) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:61) at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194) at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryColumnarTableScan.scala:169) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 16/06/09 01:08:06 WARN TaskSetManager: Lost task 120.0 in stage 16.0 (TID 1416, localhost): TaskKilled (killed intentionally) 16/06/09 01:08:07 WARN TaskSetManager: Lost task 121.0 in stage 16.0 (TID 1417, localhost): TaskKilled (killed intentionally) 16/06/09 01:08:07 INFO TaskSchedulerImpl: Removed TaskSet 16.0, whose tasks have all completed, from pool 16/06/09 01:08:07 INFO SparkContext: Invoking stop() from shutdown hook 16/06/09 01:08:07 INFO SparkUI: Stopped Spark web UI at http://192.168.1.151:4040 16/06/09 01:08:07 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 16/06/09 01:08:08 INFO MemoryStore: MemoryStore cleared 16/06/09 01:08:08 INFO BlockManager: BlockManager stopped 16/06/09 01:08:08 INFO BlockManagerMaster: BlockManagerMaster stopped 16/06/09 01:08:08 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 16/06/09 01:08:08 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 16/06/09 01:08:08 INFO SparkContext: Successfully stopped SparkContext 16/06/09 01:08:08 INFO ShutdownHookManager: Shutdown hook called 16/06/09 01:08:08 INFO ShutdownHookManager: Deleting directory /tmp/spark-f6a9a447-2c75-426d-b0ad-889bce7b1765 16/06/09 01:08:08 INFO ShutdownHookManager: Deleting directory /tmp/spark-d24d1363-0eaf-4093-b9b5-2ab56613778c 16/06/09 01:08:08 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.

Process finished with exit code 1

cupuyc commented 8 years ago

Hello, Thank you for feedback. You are the first who filled me issue.

Unfortunately I can't solve issue fast as I not sure why that happened. For me it works both when submitted to spark-1.6.1-bin-hadoop2.6 or when launched as main class in Idea IDE.

I saw similar strange issue with Data parsing exception. Then I added dummy code for checking csv values before importing. I have now idea why, but that helped. Also I spotted that it works when number of input data is small. There is a way to test on a smaller dataset via switching val IS_TEST_RUN = true

How do you launch program? What OS do you have and how much memory?

Stan

dazzyduck commented 8 years ago

Hello Stan,

   Thanks for the response. MY environment is also same as yours.

spark-1.6.1-bin-hadoop2.6. and spark-assembly-1.6.0-hadoop2.6.0. I tried IS_TEST_RUN = true option too, but same issue happens.

   Im using Idea IntelliJ IDE for launching the program . OS version

Ubuntu-14.04 running in VirtualBox and allocated memory 12GB .

Thanks Dazzy.

On Thu, Jun 9, 2016 at 11:28 PM, Stan Reshetnyk notifications@github.com wrote:

Hello, Thank you for feedback. You are the first who filled me issue.

Unfortunately I can't solve issue fast as I not sure why that happened. For me it works both when submitted to spark-1.6.1-bin-hadoop2.6 or when launched as main class in Idea IDE.

I saw similar strange issue with Data parsing exception. Then I added dummy code for checking csv values before importing. I have now idea why, but that helped. Also I spotted that it works when number of input data is small. There is a way to test on a smaller dataset via switching val IS_TEST_RUN = true

How do you launch program? What OS do you have and how much memory?

Stan

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cupuyc/scala-kaggle/issues/1#issuecomment-224931837, or mute the thread https://github.com/notifications/unsubscribe/AS55xg8vhZotQeEKHw-MqslLYNZz5obqks5qKDEbgaJpZM4IxN9h .

Regards & Thanks Kohilambal Ramesh HP: +886983672151

cupuyc commented 8 years ago

I tried on Ubuntu 14 in vagrant via sbt "sbt "run-main io.github.stanreshetnyk.ExpediaSpark"", but found other errors. One of each is:

java.io.IOException: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.HashMap$SerializationProxy to field org.apache.spark.executor.TaskMetrics._accumulatorUpdates of type scala.collection.immutable.Map in instance of org.apache.spark.executor.TaskMetrics

At the moment I don't know how to fight that. I suspect it works differently on MacOS and Ubuntu.

cupuyc commented 8 years ago

Dazzy,

I tried one more time and was able to execute sbt clean run successfully on Ubuntu (or sbt "run-main io.github.stanreshetnyk.expedia.ExpediaSpark"). I've never saw error like you had, but saw other ones. Changes that were helped for me:

Stan

dazzyduck commented 8 years ago

Hi Stan,

  Do you know how to change / check these SBT configurations in

intelliJ?

Thanks Dazzy

On Mon, Jun 13, 2016 at 5:52 PM, Stan Reshetnyk notifications@github.com wrote:

Dazzy,

I tried one more time and was able to execute sbt clean run successfully on Ubuntu (or sbt "run-main io.github.stanreshetnyk.expedia.ExpediaSpark"). I've never saw error like you had, but saw other ones. Changes that were helped for me:

  • configuring Oracle Java 8 as default for Ubuntu, regardless I set target 1.7 in sbt;
  • adding fork in run := true to build.sbt

Stan

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cupuyc/scala-kaggle/issues/1#issuecomment-225537228, or mute the thread https://github.com/notifications/unsubscribe/AS55xr6UWT0ndPky6Dh6sYqwka7vyDnjks5qLSh6gaJpZM4IxN9h .

Regards & Thanks Kohilambal Ramesh HP: +886983672151

cupuyc commented 8 years ago

It's possible to change JDK version under project settings. Not sure about other ones. Idea looks launch execution totally different way.

Stan