bigdatagenomics / eggo

Ready-to-go Parquet-formatted public 'omics datasets
Apache License 2.0
30 stars 8 forks source link

results of "genotype count" are different - depending on used programs #140

Open jerryivanhoe opened 8 years ago

jerryivanhoe commented 8 years ago

Hello,

we are using Adam to count the total number of Genotypes for the input file “ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf” from 1000 Genome project stored on s3

With Adam we run these 2 steps:

Step 1: convert to Adam with vcf2adam :

adam-submit --master yarn-client --driver-memory 8g \ --num-executors $TOTAL_EXECUTORS \ --executor-cores $CORES_PER_EXECUTOR \ --executor-memory $MEMORY_PER_EXECUTOR \ -- \ vcf2adam \ -parquet_compression_codec SNAPPY \ hdfs:///user/ec2-user/1kg/chr1.vcf \

hdfs:///user/ec2-user/1kg/chr1.adam

then we start Step2: Adam-Shell

adam-shell --master yarn-client --driver-memory 8g \ --num-executors $TOTAL_EXECUTORS \ --executor-cores $CORES_PER_EXECUTOR \ --executor-memory $MEMORY_PEREXECUTOR scala> import org.bdgenomics.adam.rdd.ADAMContext scala> import org.bdgenomics.formats.avro. scala> val ac = new ADAMContext(sc) scala> val genotypes = ac.loadGenotypes("/user/ec2-user/1kg/chr1.adam") scala> genotypes.count

the result : 16277357168

As a comparision we are also using the HTSJDK with this script:

public static void main(String[] args) throws InterruptedException,
                    SQLException, IOException {

             int position;
             String uuid, chromosome;
             VariantContext ctx;
             //HashMap<String,Boolean> gc = new HashMap<String, Boolean>();
             long count = 0;

             final VCFCodec codec = new VCFCodec();
             final FeatureReader<VariantContext> vcfr = AbstractFeatureReader
                           .getFeatureReader("c:/tmp/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", codec, false);

             final CloseableTribbleIterator<VariantContext> i = vcfr.iterator();

             while (i.hasNext()) {
                    ctx = i.next();

                    GenotypesContext g = ctx.getGenotypes();

                    count += g.size();
             }

       }

And this shows a different result:

16196107376

Any idea ? Are we using the wrong commands ?

greetings -Jerry

laserson commented 8 years ago

I would guess this has to do with how ADAM handles multiallelic variants. You'll notice that the ADAM Variant schema only supports a single alternate allele:

https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl#L430-L438

If we encounter variants with multiple alternates, we emit a Variant object for each one. Is this right, @fnothaft?

fnothaft commented 8 years ago

@laserson that is correct!