bigdatagenomics / adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Apache License 2.0
1k stars 308 forks source link

java.io.NotSerializableException: org.bdgenomics.formats.avro.AlignmentRecord #1240

Closed Fei-Guang closed 7 years ago

Fei-Guang commented 7 years ago

i run the the following code in idea

val spark = SparkSession
  .builder
  .master("local[*]")
  .appName("Anno BDG")
  .getOrCreate()

//set new runtime options
spark.conf.set("spark.sql.shuffle.partitions", 6)
spark.conf.set("spark.executor.memory", "2g")
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark.conf.set("spark.kryo.registrator", "org.bdgenomics.adam.serialization.ADAMKryoRegistrator")
val sc = spark.sparkContext
val reads = sc.loadAlignments("/data/sample.rmdup.bam")
val lines = sc.textFile("/data/win_100k.use_50mer")
// conceptually, we could just do reads.rdd.zip(lines), but!
// we aren't guaranteed that both RDDs have the same number
// of records in each partition, so zipWithIndex followed by join
// is (slower, but) safer
val zippedLinesAndReads = reads.rdd
  .zipWithIndex
  .map(_.swap)
  .join(lines.zipWithIndex.map(_.swap))

val countsByChromosome = zippedLinesAndReads.flatMap(kv => {
  val (_, (read, line)) = kv

  // get the range from the rdd2.kmer file
  val columns = line.split("\t[") // i assume this is tab delimited?
  val start = columns(4).toLong
  val end = columns(5).toLong

  // is the alignment start position between the start and end pos from the line?
  // if yes, emit the chromosome name and 1
  if (start <= read.getStart && read.getStart < end) {
    Some((read.getContigName, 1))
  } else {
    None
  }
}).reduceByKeyLocally(_ + _)

MY ENV:

spark-2.0.1-bin-hadoop2.6
adam-distribution-spark2_2.11-0.20.0
scala-2.11.8

it reports the following error:

java.io.NotSerializableException: org.bdgenomics.formats.avro.AlignmentRecord Serialization stack:

Process finished with exit code 1

Fei-Guang commented 7 years ago

Hi @Fei-Guang! Where are you running that? Are you running that in spark-shell? ADAM relies on a >>>custom Kryo serializer Registrator for serialization. If you use ./bin/adam-shell, this starts a Spark >>>shell >>>where the serialization config (and classpath) are set up.

hello @fnothaft i run it in idea, how to register a Kryo serializer in idea?

Fei-Guang commented 7 years ago

"$SPARK_SHELL" \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.kryo.registrator=org.bdgenomics.adam.serialization.ADAMKryoRegistrator \

spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark.conf.set("spark.kryo.registrator", "org.bdgenomics.adam.serialization.ADAMKryoRegistrator")
Fei-Guang commented 7 years ago

it's spark bug

avkonst commented 7 years ago

"it's spark bug" is there a link to it? what version of spark where it is solved?