bigdatagenomics / adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Apache License 2.0
996 stars 309 forks source link

adam2vcf -sort_on_save flag broken #940

Closed andrewmchen closed 8 years ago

andrewmchen commented 8 years ago

Hi all. I tried to run adam2vcf with the sort_on_save flag and got this error.

16/02/11 20:28:01 WARN TaskSetManager: Lost task 10.0 in stage 9.0 (TID 202, amp-bdg-57.amp): com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
attributes (htsjdk.variant.variantcontext.CommonInfo)
commonInfo (htsjdk.variant.variantcontext.VariantContext)
vc (org.seqdoop.hadoop_bam.VariantContextWritable)
    at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
    at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
    at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
    at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
    at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:192)
    at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217)
    at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:102)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at scala.collection.convert.Wrappers$MutableMapWrapper.put(Wrappers.scala:217)
    at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
    at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
    at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
    ... 31 more

The adam2vcf worked without the flag so I suspect it's only when I sort. I attached the full log as well.. log.txt

fnothaft commented 8 years ago

I think I know a fix (and the fix should be related to #933). Do you have a VCF on the cluster that reproduces this?

andrewmchen commented 8 years ago

Yup it's in hdfs in my home directory. It's called hdfs:///user/amchen/NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.bam.unified.raw.SNP.gatk.vcf.adam.filtered

andrewmchen commented 8 years ago

I pulled down your change in #933 and it still doesn't seem to work at least for this adam file. To reproduce you could just use bash /home/eecs/amchen/scripts/adamToVCF.sh

Here's the log log2.txt

fnothaft commented 8 years ago

@massie is looking at this

massie commented 8 years ago

@andrewmchen I'm looking at this now. Thanks for sending the script and files for your job.

When running the script, I get the following error from Parquet..

Feb 16, 2016 4:00:58 PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)

Looking at the meta data for the file, I see

creator:                   parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) 

which is version Parquet 1.7.0

Whereas the other adam file in the directory (minus the "filtered" suffix), has the following creator:

creator:                   parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf) 

We switched from Parquet 1.7.0 -> 1.8.1 in July last year.

How hard would it be to regenerate that adam file using a newer version of ADAM? It might be worth a try as I debug the root cause of the exception.

massie commented 8 years ago

@andrewmchen I just checked the Avro and Parquet schemas and they are identical so there's likely little use in recreating that file (unless it's trivial to do).

andrewmchen commented 8 years ago

The file meaning .filtered? I can recreate it without any hassle and I'll do it when I get a chance to.

It seems very peculiar that they'd have different parquet versions because I built the .filtered file like a month ago. Could it be because avocado is on a different version of adam/parquet?

massie commented 8 years ago

Sorry. I can see why that wasn't clear. Yes, the "*.filtered" file was created using Parquet 1.7.x.

That's odd. As long as Avocado is using ADAM version 0.17.1 or newer, it should be writing Parquet 0.18.x files. Avocado started using ADAM 0.17.1 in August of last year. As long as you have a relatively recent version of Avocado, you should be fine.

massie commented 8 years ago

@andrewmchen Can you verify the version of Avocado that you're using? If it less than six months old, it shouldn't be saving in Parquet 1.7.x format as far as I can tell looking at the pom files.

andrewmchen commented 8 years ago

That makes a ton of sense. I should probably rebase my avocado. The commit hash I branched off on was 2e6504f01004cd13c22f36198e6aea490bb94130.

massie commented 8 years ago

@andrewmchen I just submitted a pull request #949 that fixes this issue. When you have a moment, can you verify that it fixes your problem? I've run your test case but it's always good to have more than one set of eyes.

andrewmchen commented 8 years ago

Sure. I'll do it later tonight. Thanks for resolving this issue so quickly!

andrewmchen commented 8 years ago

This seems to have solved it. Just curious, how did this line work in the past anyways? https://github.com/bigdatagenomics/adam/pull/949/files#diff-514d6d86034c4dd8aa9ee737c8637a7eL130

heuermh commented 8 years ago

Fixed by commit 0975e303f91ac9590d84711189438f84d1966348