broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 592 forks source link

PrintVariantsSpark crashes with serialization issues #3840

Open lbergelson opened 7 years ago

lbergelson commented 7 years ago

PrintVariantsSpark crashes on dataproc with serialization issues.

Ex:

Running:
    gcloud dataproc jobs submit spark --cluster gatk-test-8875b999-b609-4a3f-86ea-973b929fe662 --properties spark.driver.userClassPathFirst=true,spark.io.compression.codec=lzf,spark.driver.maxResultSize=0,spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 ,spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 ,spark.kryoserializer.buffer.max=512m,spark.yarn.executor.memoryOverhead=600 --jar gs://hellbender-test-logs/test/staging/lb_staging/gatk-package-4.beta.6-37-g0a135f8-SNAPSHOT-spark_7002d0551e84ddef0d74adf95dfee104.jar -- PrintVariantsSpark --V gs://hellbender/test/resources/large/gvcfs/gatk3.7_30_ga4f720357.24_sample.21.expected.vcf --output gs://hellbender-test-logs/test/staging/lb_staging/756f43e6-4663-49ce-8a8c-bf717b07a8c7.vcf --sparkMaster yarn
Job [dfac787d-19aa-4296-8078-c033cd9f440d] submitted.
Waiting for job output...
19:43:09.678 WARN  SparkContextFactory - Environment variables HELLBENDER_TEST_PROJECT and HELLBENDER_JSON_SERVICE_ACCOUNT_KEY must be set or the GCS hadoop connector will not be configured properly
19:43:09.837 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/tmp/dfac787d-19aa-4296-8078-c033cd9f440d/gatk-package-4.beta.6-37-g0a135f8-SNAPSHOT-spark_7002d0551e84ddef0d74adf95dfee104.jar!/com/intel/gkl/native/libgkl_compression.so
[November 15, 2017 7:43:09 PM UTC] PrintVariantsSpark  --output gs://hellbender-test-logs/test/staging/lb_staging/756f43e6-4663-49ce-8a8c-bf717b07a8c7.vcf --variant gs://hellbender/test/resources/large/gvcfs/gatk3.7_30_ga4f720357.24_sample.21.expected.vcf --sparkMaster yarn  --variantShardSize 10000 --variantShardPadding 1000 --shuffle false --readValidationStringency SILENT --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --bamPartitionSize 0 --disableSequenceDictionaryValidation false --shardedOutput false --numReducers 0 --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --gcs_max_retries 20 --disableToolDefaultReadFilters false
[November 15, 2017 7:43:09 PM UTC] Executing as root@gatk-test-8875b999-b609-4a3f-86ea-973b929fe662-m on Linux 3.16.0-4-amd64 amd64; OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-1~bpo8+1-b11; Version: 4.beta.6-37-g0a135f8-SNAPSHOT
19:43:09.992 INFO  PrintVariantsSpark - HTSJDK Defaults.COMPRESSION_LEVEL : 1
19:43:09.992 INFO  PrintVariantsSpark - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
19:43:09.993 INFO  PrintVariantsSpark - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
19:43:09.993 INFO  PrintVariantsSpark - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
19:43:09.993 INFO  PrintVariantsSpark - Deflater: IntelDeflater
19:43:09.993 INFO  PrintVariantsSpark - Inflater: IntelInflater
19:43:09.993 INFO  PrintVariantsSpark - GCS max retries/reopens: 20
19:43:09.993 INFO  PrintVariantsSpark - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
19:43:09.993 INFO  PrintVariantsSpark - Initializing engine
19:43:09.993 INFO  PrintVariantsSpark - Done initializing engine
17/11/15 19:43:11 INFO org.spark_project.jetty.util.log: Logging initialized @4976ms
17/11/15 19:43:11 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT
17/11/15 19:43:11 INFO org.spark_project.jetty.server.Server: Started @5092ms
17/11/15 19:43:11 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@5917b44d{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
17/11/15 19:43:12 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
17/11/15 19:43:13 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at gatk-test-8875b999-b609-4a3f-86ea-973b929fe662-m/10.240.0.18:8032
17/11/15 19:43:17 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1510774921124_0001
17/11/15 19:43:28 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total input files to process : 1
17/11/15 19:43:35 ERROR org.apache.spark.scheduler.TaskResultGetter: Exception while getting task result
com.esotericsoftware.kryo.KryoException: Error during Java deserialization.
Serialization trace:
genotypes (org.seqdoop.hadoop_bam.VariantContextWithHeader)
interval (org.broadinstitute.hellbender.engine.spark.SparkSharder$PartitionLocatable)
    at com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:65)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
    at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
    at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
    at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:396)
    at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:307)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
    at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:330)
    at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:88)
    at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:72)
    at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
    at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
    at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: htsjdk.variant.variantcontext.LazyGenotypesContext
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:677)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
    at com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:63)
    ... 20 more
17/11/15 19:43:35 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@5917b44d{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
17/11/15 19:43:35 WARN org.apache.spark.ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
19:43:35.858 INFO  PrintVariantsSpark - Shutting down engine
[November 15, 2017 7:43:35 PM UTC] org.broadinstitute.hellbender.tools.spark.pipelines.PrintVariantsSpark done. Elapsed time: 0.43 minutes.
Runtime.totalMemory()=823132160
org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: Error during Java deserialization.
Serialization trace:
genotypes (org.seqdoop.hadoop_bam.VariantContextWithHeader)
interval (org.broadinstitute.hellbender.engine.spark.SparkSharder$PartitionLocatable)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
    at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:361)
    at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
    at org.broadinstitute.hellbender.engine.spark.SparkSharder.computePartitionReadExtents(SparkSharder.java:274)
    at org.broadinstitute.hellbender.engine.spark.SparkSharder.joinOverlapping(SparkSharder.java:163)
    at org.broadinstitute.hellbender.engine.spark.SparkSharder.joinOverlapping(SparkSharder.java:128)
    at org.broadinstitute.hellbender.engine.spark.SparkSharder.shard(SparkSharder.java:101)
    at org.broadinstitute.hellbender.engine.spark.VariantWalkerSpark.getVariants(VariantWalkerSpark.java:129)
    at org.broadinstitute.hellbender.engine.spark.VariantWalkerSpark.runTool(VariantWalkerSpark.java:160)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:362)
    at org.broadinstitute.hellbender.engine.spark.VariantWalkerSpark.runPipeline(VariantWalkerSpark.java:57)
    at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:119)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:176)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:137)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:158)
    at org.broadinstitute.hellbender.Main.main(Main.java:239)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
ERROR: (gcloud.dataproc.jobs.submit.spark) Job [dfac787d-19aa-4296-8078-c033cd9f440d] entered state [ERROR] while waiting for [DONE].
droazen commented 6 years ago

@tomwhite When you get a chance, could you have a look at this?

droazen commented 6 years ago

Note that there is a disabled test in DataprocIntegrationTest that should be re-enabled once this is fixed. (The test was introduced in https://github.com/broadinstitute/gatk/pull/3767).

droazen commented 5 years ago

@tomwhite Is this still an issue? If not, can we re-enable the disabled test in DataprocIntegrationTest and close this?

tomwhite commented 5 years ago
Running:
    /home/travis/gcloud/google-cloud-sdk/bin/gcloud dataproc jobs submit spark --cluster gatk-test-2495f43b-04fc-49e7-aa0a-7108cc876246 --properties spark.driver.userClassPathFirst=false,spark.io.compression.codec=lzf,spark.driver.maxResultSize=0,spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 ,spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 ,spark.kryoserializer.buffer.max=512m,spark.yarn.executor.memoryOverhead=600 --jar gs://hellbender-test-logs/staging/gatk-package-4.1.0.0-24-g18a95c7-SNAPSHOT-spark_3e9078b7e67707952fa12a0c5c4d2b71.jar -- PrintVariantsSpark --V gs://hellbender/test/resources/large/gvcfs/gatk3.7_30_ga4f720357.24_sample.21.expected.vcf --output gs://hellbender-test-logs/staging/12dc38b0-0b40-49d5-a98e-fe83ca658003.vcf --spark-master yarn
Job [654b5b8e01de4c60bd87d941d4ec8831] submitted.
Waiting for job output...
19/02/18 16:58:03 WARN org.apache.spark.SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.
16:58:09.526 WARN  SparkContextFactory - Environment variables HELLBENDER_TEST_PROJECT and HELLBENDER_JSON_SERVICE_ACCOUNT_KEY must be set or the GCS hadoop connector will not be configured properly
16:58:09.705 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/tmp/654b5b8e01de4c60bd87d941d4ec8831/gatk-package-4.1.0.0-24-g18a95c7-SNAPSHOT-spark_3e9078b7e67707952fa12a0c5c4d2b71.jar!/com/intel/gkl/native/libgkl_compression.so
16:58:10.112 INFO  PrintVariantsSpark - ------------------------------------------------------------
16:58:10.113 INFO  PrintVariantsSpark - The Genome Analysis Toolkit (GATK) v4.1.0.0-24-g18a95c7-SNAPSHOT
16:58:10.113 INFO  PrintVariantsSpark - For support and documentation go to https://software.broadinstitute.org/gatk/
16:58:10.113 INFO  PrintVariantsSpark - Executing as root@gatk-test-2495f43b-04fc-49e7-aa0a-7108cc876246-m on Linux v4.9.0-8-amd64 amd64
16:58:10.114 INFO  PrintVariantsSpark - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_181-8u181-b13-2~deb9u1-b13
16:58:10.114 INFO  PrintVariantsSpark - Start Date/Time: February 18, 2019 4:58:09 PM UTC
16:58:10.114 INFO  PrintVariantsSpark - ------------------------------------------------------------
16:58:10.114 INFO  PrintVariantsSpark - ------------------------------------------------------------
16:58:10.115 INFO  PrintVariantsSpark - HTSJDK Version: 2.18.2
16:58:10.115 INFO  PrintVariantsSpark - Picard Version: 2.18.25
16:58:10.115 INFO  PrintVariantsSpark - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:58:10.116 INFO  PrintVariantsSpark - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:58:10.116 INFO  PrintVariantsSpark - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
16:58:10.116 INFO  PrintVariantsSpark - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:58:10.116 INFO  PrintVariantsSpark - Deflater: IntelDeflater
16:58:10.116 INFO  PrintVariantsSpark - Inflater: IntelInflater
16:58:10.116 INFO  PrintVariantsSpark - GCS max retries/reopens: 20
16:58:10.116 INFO  PrintVariantsSpark - Requester pays: disabled
16:58:10.116 WARN  PrintVariantsSpark - 

?[1m?[31m   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

   Warning: PrintVariantsSpark is a BETA tool and is not yet ready for use in production

   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!?[0m

16:58:10.116 INFO  PrintVariantsSpark - Initializing engine
16:58:10.116 INFO  PrintVariantsSpark - Done initializing engine
19/02/18 16:58:10 WARN org.apache.spark.SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.
19/02/18 16:58:10 INFO org.spark_project.jetty.util.log: Logging initialized @8431ms
19/02/18 16:58:11 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
19/02/18 16:58:11 INFO org.spark_project.jetty.server.Server: Started @8536ms
19/02/18 16:58:11 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@45c90a05{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
19/02/18 16:58:11 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
19/02/18 16:58:12 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at gatk-test-2495f43b-04fc-49e7-aa0a-7108cc876246-m/10.240.0.11:8032
19/02/18 16:58:13 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at gatk-test-2495f43b-04fc-49e7-aa0a-7108cc876246-m/10.240.0.11:10200
19/02/18 16:58:15 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1550508751046_0004
WARNING 2019-02-18 16:58:23 AsciiLineReader Creating an indexable source for an AsciiFeatureCodec using a stream that is neither a PositionalBufferedStream nor a BlockCompressedInputStream
WARNING 2019-02-18 16:58:23 AsciiLineReader Creating an indexable source for an AsciiFeatureCodec using a stream that is neither a PositionalBufferedStream nor a BlockCompressedInputStream
19/02/18 16:58:25 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total input files to process : 1
19/02/18 16:58:29 ERROR org.apache.spark.scheduler.TaskResultGetter: Exception while getting task result
com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
genotypes (htsjdk.variant.variantcontext.VariantContext)
interval (org.broadinstitute.hellbender.engine.spark.SparkSharder$PartitionLocatable)
    at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
    at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
    at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:396)
    at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:307)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
    at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362)
    at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:88)
    at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:72)
    at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
    at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992)
    at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
    at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:158)
    at htsjdk.variant.variantcontext.LazyGenotypesContext.invalidateSampleOrdering(LazyGenotypesContext.java:205)
    at htsjdk.variant.variantcontext.GenotypesContext.add(GenotypesContext.java:353)
    at htsjdk.variant.variantcontext.GenotypesContext.add(GenotypesContext.java:46)
    at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134)
    at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
    at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
    ... 18 more
19/02/18 16:58:29 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@45c90a05{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
16:58:29.970 INFO  PrintVariantsSpark - Shutting down engine
[February 18, 2019 4:58:29 PM UTC] org.broadinstitute.hellbender.tools.spark.pipelines.PrintVariantsSpark done. Elapsed time: 0.34 minutes.
Runtime.totalMemory()=1106771968
org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
genotypes (htsjdk.variant.variantcontext.VariantContext)
interval (org.broadinstitute.hellbender.engine.spark.SparkSharder$PartitionLocatable)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
    at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:361)
    at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
    at org.broadinstitute.hellbender.engine.spark.SparkSharder.computePartitionReadExtents(SparkSharder.java:388)
    at org.broadinstitute.hellbender.engine.spark.SparkSharder.joinOverlapping(SparkSharder.java:212)
    at org.broadinstitute.hellbender.engine.spark.SparkSharder.joinOverlapping(SparkSharder.java:205)
    at org.broadinstitute.hellbender.engine.spark.SparkSharder.joinOverlapping(SparkSharder.java:153)
    at org.broadinstitute.hellbender.engine.spark.SparkSharder.shard(SparkSharder.java:108)
    at org.broadinstitute.hellbender.engine.spark.VariantWalkerSpark.getVariants(VariantWalkerSpark.java:127)
    at org.broadinstitute.hellbender.engine.spark.VariantWalkerSpark.runTool(VariantWalkerSpark.java:154)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:528)
    at org.broadinstitute.hellbender.engine.spark.VariantWalkerSpark.runPipeline(VariantWalkerSpark.java:56)
    at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:30)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
ERROR: (gcloud.dataproc.jobs.submit.spark) Job [654b5b8e01de4c60bd87d941d4ec8831] entered state [ERROR] while waiting for [DONE].
tomwhite commented 5 years ago

I reproduced this on a cluster. I tried the following change, but then hit https://github.com/EsotericSoftware/kryo/issues/489

diff --git a/src/main/java/org/broadinstitute/hellbender/engine/spark/GATKRegistrator.java b/src/main/java/org/broadinstitute/hellbender/engine/spark/GATKRegistrator.java
index bff76a367..cabdc7332 100644
--- a/src/main/java/org/broadinstitute/hellbender/engine/spark/GATKRegistrator.java
+++ b/src/main/java/org/broadinstitute/hellbender/engine/spark/GATKRegistrator.java
@@ -2,10 +2,12 @@ package org.broadinstitute.hellbender.engine.spark;

 import com.esotericsoftware.kryo.Kryo;
 import com.esotericsoftware.kryo.serializers.FieldSerializer;
+import com.esotericsoftware.kryo.serializers.JavaSerializer;
 import com.google.common.collect.ImmutableMap;
 import de.javakaffee.kryoserializers.UnmodifiableCollectionsSerializer;
 import de.javakaffee.kryoserializers.guava.ImmutableMapSerializer;
 import htsjdk.samtools.*;
+import htsjdk.variant.variantcontext.LazyGenotypesContext;
 import org.apache.spark.serializer.KryoRegistrator;
 import org.bdgenomics.adam.serialization.ADAMKryoRegistrator;
 import org.bdgenomics.adam.serialization.ADAMKryoRegistrator;
 import org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSparkUtils;
@@ -92,5 +94,8 @@ public class GATKRegistrator implements KryoRegistrator {
         kryo.register(MarkDuplicatesSparkUtils.IndexPair.class, new FieldSerializer(kryo, MarkDuplicatesSparkUtils.IndexPair.class));
         kryo.register(ReadsKey.class, new FieldSerializer(kryo, ReadsKey.class));
         kryo.register(ReadsKey.KeyForFragment.class, new FieldSerializer(kryo, ReadsKey.KeyForFragment.class));
-        kryo.register(ReadsKey.KeyForPair.class, new FieldSerializer(kryo, ReadsKey.KeyForPair.class)); }
+        kryo.register(ReadsKey.KeyForPair.class, new FieldSerializer(kryo, ReadsKey.KeyForPair.class));
+        // Use JavaSerializer for LazyGenotypesContext to handle transient fields correctly
+        kryo.register(LazyGenotypesContext.class, new JavaSerializer());
+    }
 }

The workaround suggested in the issue referenced above of creating a local copy of JavaSerializer works.

It also looks like https://github.com/EsotericSoftware/kryo/commit/19a6b5edee7125fbaf54c64084a8d0e13509920b may fix the problem, but I haven't tried it, and isn't available until Spark 2.4.0 (I think).