Open jerryivanhoe opened 8 years ago
I would guess this has to do with how ADAM handles multiallelic variants. You'll notice that the ADAM Variant
schema only supports a single alternate allele:
If we encounter variants with multiple alternates, we emit a Variant
object for each one. Is this right, @fnothaft?
@laserson that is correct!
Hello,
we are using Adam to count the total number of Genotypes for the input file “ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf” from 1000 Genome project stored on s3
With Adam we run these 2 steps:
Step 1: convert to Adam with vcf2adam :
adam-submit --master yarn-client --driver-memory 8g \ --num-executors $TOTAL_EXECUTORS \ --executor-cores $CORES_PER_EXECUTOR \ --executor-memory $MEMORY_PER_EXECUTOR \ -- \ vcf2adam \ -parquet_compression_codec SNAPPY \ hdfs:///user/ec2-user/1kg/chr1.vcf \
hdfs:///user/ec2-user/1kg/chr1.adam
then we start Step2: Adam-Shell
adam-shell --master yarn-client --driver-memory 8g \ --num-executors $TOTAL_EXECUTORS \ --executor-cores $CORES_PER_EXECUTOR \ --executor-memory $MEMORY_PEREXECUTOR scala> import org.bdgenomics.adam.rdd.ADAMContext scala> import org.bdgenomics.formats.avro. scala> val ac = new ADAMContext(sc) scala> val genotypes = ac.loadGenotypes("/user/ec2-user/1kg/chr1.adam") scala> genotypes.count
the result : 16277357168
As a comparision we are also using the HTSJDK with this script:
And this shows a different result:
16196107376
Any idea ? Are we using the wrong commands ?
greetings -Jerry