Open gowtham-kanagodu opened 7 years ago
Thank you for reporting the issue. By chance do you have more detailed stack ? Trying to figure out which part of the code ran into the int max limit. What value did you specify for numPartitions ?
What was the action that you were performing on the spark side for the netezza table? if you can share the repro that will be very helpful.
I did df.count. i was using 10 partitions.
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:80)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:166)
at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
at org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1515)
at org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1514)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1514)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.
Thank you for sharing repro. Problem seems to be bug in the optimization we did for count().
good to know that you are able to identify the issue. keep us posted on the fix.
Is this issue resolved?
Detailed error log. the table i am trying to access contains 2664275514 rows.
Caused by: java.lang.IllegalArgumentException: 1 to 2664275514 by 1: seqs cannot contain more than Int.MaxValue elements. at scala.collection.immutable.NumericRange$.count(NumericRange.scala:249) at scala.collection.immutable.NumericRange.numRangeElements$lzycompute(NumericRange.scala:53) at scala.collection.immutable.NumericRange.numRangeElements(NumericRange.scala:52) at scala.collection.immutable.NumericRange.length(NumericRange.scala:55) at org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:146) at org.apache.spark.rdd.ParallelCollectionRDD.getPartitions(ParallelCollectionRDD.scala:97) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120)