akiezun commented 8 years ago

using full (even sites-only) VCFs for BQSR is heavy, especially in spark, when we broadcast them. They can go > 10GB in size. Really, it's just a few million positions, so we could compress it hugely: say we have 4 million variants to consider (common sites) - that's just 4M*32bits = 16 MB.

This would require creating a special format for this (or finding an existing one that works for this case). note that need to represent indel positions (with start and end, which complicates things) too but they are much less common (1 in 10 compared to snps)

Or maybe Using a BloomFilter is the way to go. For a 10^-3 probability of failure and 4 million entries we only need ~7MB of size with 10 hash functions

lbergelson commented 8 years ago

A use for a bloom filter!? So exciting! I've been trying to find one.

akiezun commented 8 years ago

For a quick analysis, I made a serialized versions of DBSNP (13572728 variants from dbsnp_135.b37.excluding_sites_after_129.vcf), size of VCF on disk 2175071049 bytes (2.0G)

(all false postive probs are predicted, it'd be easy to measure it too). Map keys are contig names.

Map of String->BloomFilter with 0.001 false positive prob    = 27320529 bytes  (26M)
Map of String->BloomFilter with 0.0004 false positive prob   = 30943001 bytes  (30M)
Map of String->BloomFilter with 0.0001 false positive prob   = 36423625 bytes  (35M)
Map of String->BloomFilter with 0.00004 false positive prob  = 40046089 bytes  (38M)
Map of String->BloomFilter with 0.00001 false positive prob  = 45526681 bytes  (43M)
Map of String->BloomFilter with 0.000001 false positive prob = 54629745 bytes  (52M)
Map of String->int[] of positions                            = 60790452 bytes  (58M)
List<GATKVariant> made just like the one in spark BQSR      = 366463957 bytes (349M)

Variants from dbSNP cover 0.004 of the genome (15195436 bases of 3101804739) so if we want reasonable precision (number of false positives over all reported hits), say 0.9 precision (of 10 hits only 1 can be false) we need (1-0.9) x 0.004 false positive prob = 0.0004. For 0.99 precision (of 100 hits only 1 can be false) we need (1-0.99) x 0.004 false postive prob = 0.00004. These are approximations of course.

Given these numbers, I conclude that, for now, exploring BloomFilters does not seem to make sense (too little saving and too many complications with using a probabilistic data structure - eg we'd need to use it too for the walker BQSR). It does make sense however to explore alternatives to the list of GATKVariants because it's very big when serialized (maybe Kryo does a better job but it's still a big object). A simple alternative like sorted int[] may be sufficient and has attractive properties (trivial to implement and understand, O(log) lookups, 0% false positives, small size when serialized).

akiezun commented 8 years ago

Code for the analysis in branch https://github.com/broadinstitute/gatk/compare/ak_EncodeDBSNPasBloomFilters

droazen commented 8 years ago

Very interesting, @akiezun, but aren't you forgetting something? We haven't yet shown that the cost of broadcasting the variants is anything more than a tiny fraction of the total runtime, and we haven't shown how that cost scales with the number of nodes and the size of the vcfs. It's pointless to spend tons of time optimizing something and adding extra complexity if it's not a substantial fraction of runtime. I've opened https://github.com/broadinstitute/gatk/issues/1675, and proposed it as a pre-requisite to this ticket.

akiezun commented 8 years ago

sure, obviously determining that is part of this ticket.

akiezun commented 8 years ago

1675 says that it's 2-3 minutes per executor - which may be a big part of the task time. Keeping this open

akiezun commented 8 years ago

I'll unmark the milestone until we decide that this the way to do it.

tomwhite commented 5 years ago

We support vcf.gz now.

broadinstitute / gatk

use a compressed version of dbsnp files for BQSR #1407

1675 says that it's 2-3 minutes per executor - which may be a big part of the task time. Keeping this open