java.io.NotSerializableException: org.bdgenomics.formats.avro.AlignmentRecord

Fei-Guang commented 7 years ago

i run the the following code in idea

val spark = SparkSession
  .builder
  .master("local[*]")
  .appName("Anno BDG")
  .getOrCreate()

//set new runtime options
spark.conf.set("spark.sql.shuffle.partitions", 6)
spark.conf.set("spark.executor.memory", "2g")
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark.conf.set("spark.kryo.registrator", "org.bdgenomics.adam.serialization.ADAMKryoRegistrator")
val sc = spark.sparkContext
val reads = sc.loadAlignments("/data/sample.rmdup.bam")
val lines = sc.textFile("/data/win_100k.use_50mer")
// conceptually, we could just do reads.rdd.zip(lines), but!
// we aren't guaranteed that both RDDs have the same number
// of records in each partition, so zipWithIndex followed by join
// is (slower, but) safer
val zippedLinesAndReads = reads.rdd
  .zipWithIndex
  .map(_.swap)
  .join(lines.zipWithIndex.map(_.swap))

val countsByChromosome = zippedLinesAndReads.flatMap(kv => {
  val (_, (read, line)) = kv

  // get the range from the rdd2.kmer file
  val columns = line.split("\t[") // i assume this is tab delimited?
  val start = columns(4).toLong
  val end = columns(5).toLong

  // is the alignment start position between the start and end pos from the line?
  // if yes, emit the chromosome name and 1
  if (start <= read.getStart && read.getStart < end) {
    Some((read.getContigName, 1))
  } else {
    None
  }
}).reduceByKeyLocally(_ + _)

MY ENV:

spark-2.0.1-bin-hadoop2.6
adam-distribution-spark2_2.11-0.20.0
scala-2.11.8

it reports the following error:

java.io.NotSerializableException: org.bdgenomics.formats.avro.AlignmentRecord Serialization stack:

object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, value: {"readInFragment": 0, "contigName": "chr10", "start": 61758687, "oldPosition": null, "end": 61758727, "mapq": 25, "readName": "NB501244AR:119:HJY3WBGXY:2:11112:6137:19359", "sequence": "AAAATACTGAGACTTATCAGAATTTCAGGCTAAAGCAACC", "qual": "AAAAAAEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEE", "cigar": "40M", "oldCigar": null, "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, "properPair": false, "readMapped": true, "mateMapped": false, "failedVendorQualityChecks": false, "duplicateRead": false, "readNegativeStrand": false, "mateNegativeStrand": false, "primaryAlignment": true, "secondaryAlignment": false, "supplementaryAlignment": false, "mismatchingPositions": "40", "origQual": null, "attributes": "XT:A:U\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX1:i:0\tX0:i:1", "recordGroupName": null, "recordGroupSample": null, "mateAlignmentStart": null, "mateAlignmentEnd": null, "mateContigName": null, "inferredInsertSize": null}) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:135) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:185) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:150) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-11-04 10:30:56 ERROR TaskSetManager:70 - Task 0.0 in stage 2.0 (TID 9) had a not serializable result: org.bdgenomics.formats.avro.AlignmentRecord Serialization stack:
object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, value: {"readInFragment": 0, "contigName": "chr1", "start": 10001, "oldPosition": null, "end": 10041, "mapq": 0, "readName": "NB501244AR:119:HJY3WBGXY:3:11508:7857:8792", "sequence": "AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC", "qual": "///E////6E////EEAEEE/EEEEEEEEEEEEAEAAA/A", "cigar": "40M", "oldCigar": null, "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, "properPair": false, "readMapped": true, "mateMapped": false, "failedVendorQualityChecks": false, "duplicateRead": false, "readNegativeStrand": true, "mateNegativeStrand": false, "primaryAlignment": true, "secondaryAlignment": false, "supplementaryAlignment": false, "mismatchingPositions": "40", "origQual": null, "attributes": "XT:A:R\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX0:i:594", "recordGroupName": null, "recordGroupSample": null, "mateAlignmentStart": null, "mateAlignmentEnd": null, "mateContigName": null, "inferredInsertSize": null}); not retrying 2016-11-04 10:30:56 ERROR TaskSetManager:70 - Task 4.0 in stage 2.0 (TID 13) had a not serializable result: org.bdgenomics.formats.avro.AlignmentRecord Serialization stack:
object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, value: {"readInFragment": 0, "contigName": "chr10", "start": 61758687, "oldPosition": null, "end": 61758727, "mapq": 25, "readName": "NB501244AR:119:HJY3WBGXY:2:11112:6137:19359", "sequence": "AAAATACTGAGACTTATCAGAATTTCAGGCTAAAGCAACC", "qual": "AAAAAAEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEE", "cigar": "40M", "oldCigar": null, "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, "properPair": false, "readMapped": true, "mateMapped": false, "failedVendorQualityChecks": false, "duplicateRead": false, "readNegativeStrand": false, "mateNegativeStrand": false, "primaryAlignment": true, "secondaryAlignment": false, "supplementaryAlignment": false, "mismatchingPositions": "40", "origQual": null, "attributes": "XT:A:U\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX1:i:0\tX0:i:1", "recordGroupName": null, "recordGroupSample": null, "mateAlignmentStart": null, "mateAlignmentEnd": null, "mateContigName": null, "inferredInsertSize": null}); not retrying 2016-11-04 10:30:56 ERROR TaskSetManager:70 - Task 3.0 in stage 2.0 (TID 12) had a not serializable result: org.bdgenomics.formats.avro.AlignmentRecord Serialization stack:
object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, value: {"readInFragment": 0, "contigName": "chr7", "start": 68163823, "oldPosition": null, "end": 68163863, "mapq": 0, "readName": "NB501244AR:119:HJY3WBGXY:4:21602:16293:18064", "sequence": "TGTGAGGGTGTTGCCCAAAAGAGATTAACATTTGAGTCAG", "qual": "AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE", "cigar": "40M", "oldCigar": null, "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, "properPair": false, "readMapped": true, "mateMapped": false, "failedVendorQualityChecks": false, "duplicateRead": false, "readNegativeStrand": false, "mateNegativeStrand": false, "primaryAlignment": true, "secondaryAlignment": false, "supplementaryAlignment": false, "mismatchingPositions": "40", "origQual": null, "attributes": "XT:A:R\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tXA:Z:chr3,-84617448,40M,0;\tX1:i:0\tX0:i:2", "recordGroupName": null, "recordGroupSample": null, "mateAlignmentStart": null, "mateAlignmentEnd": null, "mateContigName": null, "inferredInsertSize": null}); not retrying 2016-11-04 10:30:56 ERROR TaskSetManager:70 - Task 2.0 in stage 2.0 (TID 11) had a not serializable result: org.bdgenomics.formats.avro.AlignmentRecord Serialization stack:
object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, value: {"readInFragment": 0, "contigName": "chr4", "start": 181076278, "oldPosition": null, "end": 181076318, "mapq": 25, "readName": "NB501244AR:119:HJY3WBGXY:2:23302:26459:8305", "sequence": "CACTGTGTTTTACTTCTATTTTAAAAAACCTGAAGGCTAT", "qual": "EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA", "cigar": "40M", "oldCigar": null, "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, "properPair": false, "readMapped": true, "mateMapped": false, "failedVendorQualityChecks": false, "duplicateRead": false, "readNegativeStrand": true, "mateNegativeStrand": false, "primaryAlignment": true, "secondaryAlignment": false, "supplementaryAlignment": false, "mismatchingPositions": "40", "origQual": null, "attributes": "XT:A:U\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX1:i:0\tX0:i:1", "recordGroupName": null, "recordGroupSample": null, "mateAlignmentStart": null, "mateAlignmentEnd": null, "mateContigName": null, "inferredInsertSize": null}); not retrying

Process finished with exit code 1

Fei-Guang commented 7 years ago

Hi @Fei-Guang! Where are you running that? Are you running that in spark-shell? ADAM relies on a >>>custom Kryo serializer Registrator for serialization. If you use ./bin/adam-shell, this starts a Spark >>>shell >>>where the serialization config (and classpath) are set up.

hello @fnothaft i run it in idea, how to register a Kryo serializer in idea?

Fei-Guang commented 7 years ago

"$SPARK_SHELL" \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.kryo.registrator=org.bdgenomics.adam.serialization.ADAMKryoRegistrator \

spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark.conf.set("spark.kryo.registrator", "org.bdgenomics.adam.serialization.ADAMKryoRegistrator")

Fei-Guang commented 7 years ago

it's spark bug

avkonst commented 7 years ago

"it's spark bug" is there a link to it? what version of spark where it is solved?

bigdatagenomics / adam

java.io.NotSerializableException: org.bdgenomics.formats.avro.AlignmentRecord #1240