dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
145 stars 37 forks source link

Performance issue with std::set used as a range filter when exome bed targets are used #171

Closed xunjieli closed 5 years ago

xunjieli commented 5 years ago

Exome bed file can contain close to 300k records. std::set<range> of this size will slow the import stage to a crawl.

Local profiling shows that most of the time (~86% in this case) is spent in iterating the rb tree that backs the std::set<range>. T5YxPcDTfZW

Can we improve this part? Maybe std::vector instead of std::set, and do a binary search of overlapping ranges? Or a better solution as you suggested in the comment below, it will be great if you can consider implementing the tabix build (if an index doesn't already exist) and query using htslib/tbx.h, and so we can do without the std::set here and use a flat container.

https://github.com/dnanexus-rnd/GLnexus/blob/master/src/BCFKeyValueData.cc#L933

static Status bulk_insert_gvcf_key_values(BCFBucketRange& rangeHelper,
                                          MetadataCache& metadata,
                                          KeyValue::DB* db,
                                          const string& dataset,
                                          const string& filename,
                                          const set<range>& range_filter,
                                          const bcf_hdr_t *hdr,
                                          vcfFile *vcf,
                                          BCFKeyValueData::import_result& rslt) {
...
    for(c = bcf_read(vcf, hdr, vt.get());
        c == 0 && vt->errcode == 0;
        c = bcf_read(vcf, hdr, vt.get())) {
        range vt_rng(vt.get());
        last_range = vt_rng;
        if (!range_filter.empty()) {
            // Test range filter if applicable. It would be nice to use a
            // tabix index instead.
            if (all_of(range_filter.begin(), range_filter.end(),
                       [&vt_rng](const range& r) { return !r.overlaps(vt_rng); })) {
                continue;
            }
        }
mlin commented 5 years ago

Hi, thanks for reporting! I think this is unintended use due to vestigiality and poor documentation by us...

The all_of cited above is going to be a linear scan of the set<ranges> which is bad for large range sets. The glnexus_cli is hardcoded to supply an empty set for this argument. IIRC it's there for another use case that involved importing the gVCFs one or a few chromosomes at a time. Sorry this was a trap.

Suggestions,

  1. It might be adequate to similarly supply an empty set or one chromosome to import_gvcf. Importing "extra" records into the internal database shouldn't cost much or hurt anything, as long as the exact target regions are used later for allele discovery and so on.
  2. A better way to do the chromosome filtering would be to either (i) use tabix to split up the gVCF files by chromosome beforehand or (ii) have import_gvcf be able to use a tabix index file to select the chromosome without scanning everything else.
  3. If we felt strongly that we need to filter the database import with the full list of target regions, there are many options for a more appropriate data structure. Actually there's a quite new one I'd filed something to look into (#166), which this would be a solid use case for.
xunjieli commented 5 years ago

Thanks the suggestions. We used #1, which worked reasonably well for us. Please feel free to close this.

mlin commented 5 years ago

Great, thanks for the feedback. I would like to do 2(ii) when time permits :smiley: