problems with collect - Githubissues

piccolbo commented 9 years ago

dim(collect(flights_SparkSQL))
Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ",  : 
  Unable to retrieve JDBC result set for SELECT  year ,  month ,  day ,  dep_time ,  dep_delay ,  arr_time ,  arr_delay ,  carrier ,  tailnum ,  flight ,  origin ,  dest ,  air_time ,  distance ,  hour ,  minute 
FROM  flights  (org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 65.0 failed 1 times, most recent failure: Lost task 1.0 in stage 65.0 (TID 1042, localhost): java.lang.OutOfMemoryError: Java heap space
    at com.esotericsoftware.kryo.io.Output.require(Output.java:142)
    at com.esotericsoftware.kryo.io.Output.writeInt(Output.java:242)
    at com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:95)
    at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:81)
    at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
    at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
    at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.

piccolbo commented 9 years ago

looked into alternative to write to a file first. Looks like there's two pull requests available here

https://issues.apache.org/jira/browse/SPARK-4131

it's targeted for 1.5 and unresolved

piccolbo commented 9 years ago

There are some options to the thriftserver and some spark configuration properties that are relevant.

options --driver-memory 1G --executor-memory 2G

Property

spark.kryoserializer.buffer.max.mb 128

I managed to collect up to half of the flights table, at the breakneck speed of 150K data points per second

piccolbo commented 9 years ago

Not always but the thiftserver can terminate if collect fails this way

piccolbo commented 9 years ago

given this and problems with #20 it may preferable to use insert overwrite local directory and read from there (not supported targeted for 1.5). Adds one read and one write, but beats the totally broken fetch any time.

RevolutionAnalytics / dplyr-spark

problems with collect #16