bigdatagenomics / adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Apache License 2.0
1k stars 308 forks source link

Issue with transformVariant // Adam to vcf #1782

Closed Rokshan2016 closed 6 years ago

Rokshan2016 commented 6 years ago

Hi, I am trying to convert adam to vcf . But getting this error. Is there any other way I can convert the adam file to .vcf file?

Command:

./adam-submit transformVariants hdfs://ip-10-48-3-5.ips.local:8020/user/rokshan.jahan/data/1000G_omni.adam/ hdfs://ip-10-48-3-5.ips.local:8020/user/rokshan.jahan/data/100G_omni1.vcf -coalesce 1

or

./adam-submit transformVariants hdfs://ip-10-48-3-5.ips.local:8020/user/rokshan.jahan/data/1000G_omni.adam/part-r-00000.gz.parquet hdfs://ip-10-48-3-5.ips.local:8020/user/rokshan.jahan/data/100G_omni1.vcf -coalesce 1

Error:

java.io.FileNotFoundException: Couldn't find any files matching hdfs://ip-10-48-3-5.ips.local:8020/user/rokshan.jahan/data/1000G_omni.adam/part-r-00000.gz.parquet 17/10/24 19:06:48 INFO cli.TransformVariants: Overall Duration: 10.08 secs Exception in thread "main" java.io.FileNotFoundException: Couldn't find any files matching hdfs://ip-10-48-3-5.ips.local:8020/user/rokshan.jahan/data/1000G_omni.adam/part-r-00000.gz.parquet at org.bdgenomics.adam.rdd.ADAMContext.getFsAndFilesWithFilter(ADAMContext.scala:1354) at org.bdgenomics.adam.rdd.ADAMContext.loadAvroSequenceDictionary(ADAMContext.scala:1164) at org.bdgenomics.adam.rdd.ADAMContext.loadParquetVariants(ADAMContext.scala:2124) at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVariants$1.apply(ADAMContext.scala:2779) at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVariants$1.apply(ADAMContext.scala:2774) at scala.Option.fold(Option.scala:157) at org.apache.spark.rdd.Timer.time(Timer.scala:48) at org.bdgenomics.adam.rdd.ADAMContext.loadVariants(ADAMContext.scala:2772) at org.bdgenomics.adam.cli.TransformVariants.run(TransformVariants.scala:120) at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55) at org.bdgenomics.adam.cli.TransformVariants.run(TransformVariants.scala:74) at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:126) at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:65) at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 17/10/24 19:06:48 INFO spark.SparkContext: Invoking stop() from shutdown hook 17/10/24 19:06:48 INFO ui.SparkUI: Stopped Spark web UI at http://10.48.3.64:4040 17/10/24 19:06:48 INFO cluster.Ya

heuermh commented 6 years ago

What does hadoop fs -ls hdfs://ip-10-48-3-5.ips.local:8020/user/rokshan.jahan/data/1000G_omni.adam show?

Rokshan2016 commented 6 years ago

First of all I convert it from 100G_omni1.vcf to 1000G_omni.adam. And i works fine. But when I tried to convert 1000G_omni.adam to vcf again it gives an error.

1000G_omni.adam contains

fileformat=VCFv4.2

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FilterLiftedVariants="analysis_type=FilterLiftedVariants input_file=[] read_buffer_size=null phone_home=STANDARD read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL reference_sequence=/humgen/1kg/reference/human_g1k_v37.fasta rodBind=[] nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_threads=1 read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false disable_experimental_low_memory_sharding=false logging_level=INFO log_to_file=null help=false variant=(RodBinding name=variant source=./0.451323408008651.sorted.vcf) out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub filter_mismatching_base_and_quals=false"

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

fnothaft commented 6 years ago

Hi @Rokshan2016

What happens if you run the command:

./adam-submit transformVariants hdfs://ip-10-48-3-5.ips.local:8020/user/rokshan.jahan/data/1000G_omni.adam hdfs://ip-10-48-3-5.ips.local:8020/user/rokshan.jahan/data/100G_omni1.vcf -coalesce 1

Specifically, this changes the input file name from hdfs://ip-10-48-3-5.ips.local:8020/user/rokshan.jahan/data/1000G_omni.adam/part-r-00000.gz.parquet to hdfs://ip-10-48-3-5.ips.local:8020/user/rokshan.jahan/data/1000G_omni.adam.

Rokshan2016 commented 6 years ago

yes. I tried that as well. Same error

Rokshan2016 commented 6 years ago

I tried with latest adam version. Command : ./adam-submit --driver-memory 3g --executor-memory 3g -- adam2vcf hdfs://ipsawdvpvfhnn03.ips.local:8020/user/rokshan.jahan/data/SRR1517974A.adam hdfs://ipsawdvpvfhnn03.ips.local:8020/user/rokshan.jahan/data/SRR1517974P.vcf

Error : : 22365 length: 22365 hosts: []} SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 17/10/25 12:43:11 INFO ZlibFactory: Successfully loaded & initialized native-zlib library 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO Executor: Finished task 13.0 in stage 0.0 (TID 13). 1264 bytes result sent to driver Oct 25, 2017 12:43:10 PM INFO: org.apache.parquet.hadoop.ParquetInputFormat: Total input paths to process : 200 Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 0 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 3 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 3 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 3 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 36 ms. row count = 2 Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 34 ms. row count = 3 Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in17/10/25 12:43:11 INFO TaskSetManager: Starting task 16.0 in stage 0.0 (TID 16, localhost, executor driver, partition 16, ANY, 6157 bytes) 17/10/25 12:43:11 INFO Executor: Running task 16.0 in stage 0.0 (TID 16) 17/10/25 12:43:11 INFO NewHadoopRDD: Input split: ParquetInputSplit{part: hdfs://ipsawdvpvfhnn03.ips.local:8020/user/rokshan.jahan/data/SRR1517974A.adam/part-r-00016.gz.parquet start: 0 end: 22365 length: 22365 hosts: []} 17/10/25 12:43:11 INFO Executor: Finished task 16.0 in stage 0.0 (TID 16). 1191 bytes result sent to driver 17/10/25 12:43:11 INFO TaskSetManager: Starting task 17.0 in stage 0.0 (TID 17, localhost, executor driver, partition 17, ANY, 6158 bytes) 17/10/25 12:43:11 INFO Executor: Running task 17.0 in stage 0.0 (TID 17) 17/10/25 12:43:11 INFO NewHadoopRDD: Input split: ParquetInputSplit{part: hdfs://ipsawdvpvfhnn03.ips.local:8020/user/rokshan.jahan/data/SRR1517974A.adam/part-r-00017.gz.parquet start: 0 end: 32533 length: 32533 hosts: []} 17/10/25 12:43:11 INFO TaskSetManager: Finished task 13.0 in stage 0.0 (TID 13) in 671 ms on localhost (executor driver) (1/200) 17/10/25 12:43:11 INFO TaskSetManager: Finished task 16.0 in stage 0.0 (TID 16) in 52 ms on localhost (executor driver) (2/200) 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:12 ERROR Executor: Exception in task 10.0 in stage 0.0 (TID 10) java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Float at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:80) at org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:138) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:68) at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114) at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58) at org.bdgenomics.adam.serialization.AvroSerializer.write(ADAMKryoRegistrator.scala:52) at org.bdgenomics.adam.serialization.AvroSerializer.write(ADAMKryoRegistrator.scala:41) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:207) at org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:135) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:239) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:152) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/10/25 12:43:12 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID 8) java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Float at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:80) at org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:138) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:68) at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114) at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58) at org.bdgenomics.adam.serialization.AvroSerializer.write(ADAMKryoRegistrator.scala:52) at org.bdgenomics.adam.serialization.AvroSerializer.write(ADAMKryoRegistrator.scala:41) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:207) at org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:135) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:239) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:152) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/10/25 12:43:12 INFO TaskSetManager: Starting task 18.0 in stage 0.0 (TID 18, localhost, executor driver, partition 18, ANY, 6158 bytes) 17/10/25 12:43:12 INFO Executor: Running task 18.0 in stage 0.0 (TID 18) 17/10/25 12:43:12 INFO TaskSetManager: Starting task 19.0 in stage 0.0 (TID 19, localhost, executor driver, partition 19, ANY, 6156 bytes) 17/10/25 12:43:12 INFO Executor: Running task 19.0 in stage 0.0 (TID 19) 17/10/25 12:43:12 WARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8, localhost, executor driver): java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Float at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:80) at org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:138) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:68) at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114) at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58) at org.bdgenomics.adam.serialization.AvroSerializer.write(ADAMKryoRegistrator.scala:52) at org.bdgenomics.adam.serialization.AvroSerializer.write(ADAMKryoRegistrator.scala:41) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:207) at org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:135) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:239) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:152) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at

Rokshan2016 commented 6 years ago

And this time it just printed the header, like this SRR1517974P.vcf_head

fileformat=VCFv4.2

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FILTER=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT $SRR1517974

Cancel

Rokshan2016 commented 6 years ago

SRR1517974.fastq -> bwa alignment -> SRR1517974.sam -> sort + mark duplicate + base_recalibration --> SRR1517974.adam -> variant calling with avocado ->SRR1517974A.adam

heuermh commented 6 years ago

Note -coalesce 1 is not the same as -single. When transforming from Variants in Parquet+Avro format to VCF, if you want the VCF in a single file, you need the -single argument.

$ adam-submit transformVariants in.vcf in.variants.adam

$ ls -ls in.adam/
total 152
 0 -rw-r--r--  1       0 Oct 25 11:49 _SUCCESS
32 -rw-r--r--  1   13652 Oct 25 11:49 _common_metadata
24 -rw-r--r--  1    9419 Oct 25 11:49 _header
40 -rw-r--r--  1   18408 Oct 25 11:49 _metadata
 8 -rw-r--r--  1    1398 Oct 25 11:49 _seqdict.avro
48 -rw-r--r--  1   20904 Oct 25 11:49 part-r-00000.gz.parquet

All the output partitions are merged into a single VCF file

$ adam-submit transformVariants -single in.variants.adam out.vcf 

$ ls -ls out.vcf 
24 -rw-r--r--  1   10552 Oct 25 11:51 out.vcf

If you leave out -single, then you get a directory of partitions of the VCF file

$ adam-submit transformVariants -single in.variants.adam out.vcf 

$ ls -ls out.vcf 
total 24
 0 -rw-r--r--  1 heuermh  staff      0 Oct 25 11:52 _SUCCESS
24 -rw-r--r--  1 heuermh  staff  10552 Oct 25 11:52 part-r-00000

(In this small example, there is only enough data to fill one partition. A larger data set would show multiple part-r-* files.)

Rokshan2016 commented 6 years ago

ok , got that. I tried with single option. But its giving the same error. Any suggestion for latest adam version? because I am using avocado in spark 2 . So thats why using adam latest version. Trying to use the command adam2vcf

Rokshan2016 commented 6 years ago

Is there any option in avocado that I can use to get the output file in one single file. Because i am getting 199 parquet files.

Rokshan2016 commented 6 years ago

Hi @

I tried this command: ./adam-submit transformVariants -single hdfs://ip-10-48-3-5.ips.local:8020/user/rokshan.jahan/data/SRR1518011.adam hdfs://ip-10-48-3-5.ips.local:8020/user/rokshan.jahan/data/SRR1518011.vcf

Now I am getting this issue:

17/10/25 13:21:41 INFO scheduler.TaskSetManager: Starting task 14.0 in stage 0.0 (TID 14, ip-10-48-3-65.ips.local, executor 4, partition 14, NODE_LOCAL, 2310 bytes) 17/10/25 13:21:41 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-10-48-3-65.ips.local, executor 4): org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch. Avro field 'variant' not found. at org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:128) at org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:89) at org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:64) at org.apache.parquet.avro.AvroCompatRecordMaterializer.(AvroCompatRecordMaterializer.java:34) at org.apache.parquet.avro.AvroReadSupport.newCompatMaterializer(AvroReadSupport.java:138) at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:130) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:179) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:201) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:168) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:133) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

17/10/25 13:21:41 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 0.0 (TID 15, ip-10-48-3-65.ips.local, executor 4, partition 0, NODE_LOCAL, 2310 bytes) 17/10/25 13:21:41 INFO scheduler.TaskSetManager: Lost task 14.0 in stage 0.0 (TID 14) on ip-10-48-3-65.ips.local, executor 4: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'variant' not found.) [duplicate 1] 17/10/25 13:21:41 INFO scheduler.TaskSetManager: Starting task 14.1 in stage 0.0 (TID 16, ip-10-48-3-65.ips.local, executor 4, partition 14, NODE_LOCAL, 2310 bytes) 17/10/25 13:21:41 INFO scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 (TID 15) on ip-10-48-3-65.ips.local, executor 4: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'variant' not found.) [duplicate 2] 17/10/25 13:21:42 INFO scheduler.TaskSetManager: Starting task 0.2 in stage 0.0 (TID 17, ip-10-48-3-65.ips.local, executor 8, partition 0, NODE_LOCAL, 2310 bytes) 17/10/25 13:21:42 INFO scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on ip-10-48-3-65.ips.local, executor 8: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'variant' not found.) [duplicate 3] 17/10/25 13:21:42 INFO scheduler.TaskSetManager: Starting task 1.1 in stage 0.0 (TID 18, ip-10-48-3-65.ips.local, executor 4, partition 1, NODE_LOCAL, 2310 bytes) 17/10/25 13:21:42 INFO scheduler.TaskSetManager: Lost task 14.1 in stage 0.0 (TID 16) on ip-10-48-3-65.ips.local, executor 4: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'variant' not found.) [duplicate 4] 17/10/25 13:21:42 INFO scheduler.TaskSetManager: Starting task 14.2 in stage 0.0 (TID 19, ip-10-48-3-65.ips.local, executor 4, partition 14, NODE_LOCAL, 2310 bytes) 17/10/25 13:21:42 INFO scheduler.TaskSetManager: Lost task 1.1 in stage 0.0 (TID 18) on ip-10-48-3-65.ips.local, executor 4: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'variant' not found.) [duplicate 5] 17/10/25 13:21:42 INFO scheduler.TaskSetManager: Starting task 1.2 in stage 0.0 (TID 20, ip-10-48-3-65.ips.local, executor 4, partition 1, NODE_LOCAL, 2310 bytes) 17/10/25 13:21:42 INFO scheduler.TaskSetManager: Lost task 14.2 in stage 0.0 (TID 19) on ip-10-48-3-65.ips.local, executor 4: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'variant' not found.) [duplicate 6] 17/10/25 13:21:42 INFO scheduler.TaskSetManager: Starting task 14.3 in stage 0.0 (TID 21, ip-10-48-3-65.ips.local, executor 8, partition 14, NODE_LOCAL, 2310 bytes) 17/10/25 13:21:42 INFO scheduler.TaskSetManager: Lost task 0.2 in stage 0.0 (TID 17) on ip-10-48-3-65.ips.local, executor 8: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'variant' not found.) [duplicate 7] 17/10/25 13:21:42 INFO scheduler.TaskSetManager: Starting task 0.3 in stage 0.0 (TID 22, ip-10-48-3-65.ips.local, executor 4, partition 0, NODE_LOCAL, 2310 bytes) 17/10/25 13:21:42 INFO scheduler.TaskSetManager: Lost task 1.2 in stage 0.0 (TID 20) on ip-10-48-3-65.ips.local, executor 4: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'variant' not found.) [duplicate 8] 17/10/25 13:21:42 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-10-48-3-64.ips.local:38844 (size: 28.8 KB, free: 530.0 MB) 17/10/25 13:21:42 INFO scheduler.TaskSetManager: Starting task 1.3 in stage 0.0 (TID 23, ip-10-48-3-65.ips.local, executor 8, partition 1, NODE_LOCAL, 2310 bytes) 17/10/25 13:21:42 INFO scheduler.TaskSetManager: Lost task 14.3 in stage 0.0 (TID 21) on ip-10-48-3-65.ips.local, executor 8: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'variant' not found.) [duplicate 9] 17/10/25 13:21:42 ERROR scheduler.TaskSetManager: Task 14 in stage 0.0 failed 4 times; aborting job 17/10/25 13:21:42 INFO cluster.YarnScheduler: Cancelling stage 0 17/10/25 13:21:42 INFO cluster.YarnScheduler: Stage 0 was cancelled 17/10/

fnothaft commented 6 years ago

Hi @Rokshan2016! Do you know what versions of ADAM and Avocado you are running? It looks like you are running incompatible versions of the two tools. If you call either tool with --version, it will print the Git commit hashes your JAR was built from.

BTW, you can't use transformVariants with the output of Avocado; you need to use transformGenotypes. Additionally, https://github.com/bigdatagenomics/avocado/pull/266 added code that directly exports Avocado's genotype output to VCF.

Rokshan2016 commented 6 years ago

Hi @fnothaft I am using the following versions --

ADAM version: 0.22.0 Built for: Apache Spark 2.1.0, Scala 2.11.8, and Hadoop 2.7.3

Avocado :

Avocado version: 0.0.3-SNAPSHOT Commit: ${git.commit.id} Build: ${timestamp} Built for: Scala 2.11.8 and Hadoop 2.6.0

** In new version Adam, do not find transformGenotypes option. But found the following options

ADAM ACTIONS countKmers : Counts the k-mers/q-mers from a read dataset. countContigKmers : Counts the k-mers/q-mers from a read dataset. transform : Convert SAM/BAM to ADAM format and optionally perform read pre-processing transformations transformFeatures : Convert a file with sequence features into corresponding ADAM format and vice versa mergeShards : Merges the shards of a file reads2coverage : Calculate the coverage from a given ADAM file

CONVERSION OPERATIONS vcf2adam : Convert a VCF file to the corresponding ADAM format adam2vcf : Convert an ADAM variant to the VCF ADAM format fasta2adam : Converts a text FASTA sequence file into an ADAMNucleotideContig Parquet file which represents assembled sequences. adam2fasta : Convert ADAM nucleotide contig fragments to FASTA files adam2fastq : Convert BAM to FASTQ files fragments2reads : Convert alignment records into fragment records. reads2fragments : Convert alignment records into fragment records.

PRINT print : Print an ADAM formatted file flagstat : Print statistics on reads in an ADAM file (similar to samtools flagstat) view : View certain reads from an alignment-record file.

I tried adam2vcf but it is giving this error:

I tried with latest adam version. Command : ./adam-submit --driver-memory 3g --executor-memory 3g -- adam2vcf hdfs://ipsawdvpvfhnn03.ips.local:8020/user/rokshan.jahan/data/SRR1517974A.adam hdfs://ipsawdvpvfhnn03.ips.local:8020/user/rokshan.jahan/data/SRR1517974P.vcf

Error : : 22365 length: 22365 hosts: []} SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 17/10/25 12:43:11 INFO ZlibFactory: Successfully loaded & initialized native-zlib library 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:11 INFO Executor: Finished task 13.0 in stage 0.0 (TID 13). 1264 bytes result sent to driver Oct 25, 2017 12:43:10 PM INFO: org.apache.parquet.hadoop.ParquetInputFormat: Total input paths to process : 200 Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 0 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 3 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 3 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 3 records. Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 36 ms. row count = 2 Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 34 ms. row count = 3 Oct 25, 2017 12:43:11 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in17/10/25 12:43:11 INFO TaskSetManager: Starting task 16.0 in stage 0.0 (TID 16, localhost, executor driver, partition 16, ANY, 6157 bytes) 17/10/25 12:43:11 INFO Executor: Running task 16.0 in stage 0.0 (TID 16) 17/10/25 12:43:11 INFO NewHadoopRDD: Input split: ParquetInputSplit{part: hdfs://ipsawdvpvfhnn03.ips.local:8020/user/rokshan.jahan/data/SRR1517974A.adam/part-r-00016.gz.parquet start: 0 end: 22365 length: 22365 hosts: []} 17/10/25 12:43:11 INFO Executor: Finished task 16.0 in stage 0.0 (TID 16). 1191 bytes result sent to driver 17/10/25 12:43:11 INFO TaskSetManager: Starting task 17.0 in stage 0.0 (TID 17, localhost, executor driver, partition 17, ANY, 6158 bytes) 17/10/25 12:43:11 INFO Executor: Running task 17.0 in stage 0.0 (TID 17) 17/10/25 12:43:11 INFO NewHadoopRDD: Input split: ParquetInputSplit{part: hdfs://ipsawdvpvfhnn03.ips.local:8020/user/rokshan.jahan/data/SRR1517974A.adam/part-r-00017.gz.parquet start: 0 end: 32533 length: 32533 hosts: []} 17/10/25 12:43:11 INFO TaskSetManager: Finished task 13.0 in stage 0.0 (TID 13) in 671 ms on localhost (executor driver) (1/200) 17/10/25 12:43:11 INFO TaskSetManager: Finished task 16.0 in stage 0.0 (TID 16) in 52 ms on localhost (executor driver) (2/200) 17/10/25 12:43:11 INFO CodecPool: Got brand-new decompressor [.gz] 17/10/25 12:43:12 ERROR Executor: Exception in task 10.0 in stage 0.0 (TID 10) java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Float at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:80) at org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:138) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:68) at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114) at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58) at org.bdgenomics.adam.serialization.AvroSerializer.write(ADAMKryoRegistrator.scala:52) at org.bdgenomics.adam.serialization.AvroSerializer.write(ADAMKryoRegistrator.scala:41) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:207) at org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:135) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:239) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:152) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo

** I am expecting if we can have some tool that can convert the avocado output(.adam) to .vcf file

Rokshan2016 commented 6 years ago

Hi, TransformGenotypes works fine.

heuermh commented 6 years ago

Thank you, @Rokshan2016