Closed xunjieli closed 5 years ago
Hi, thanks for reporting! I think this is unintended use due to vestigiality and poor documentation by us...
The all_of
cited above is going to be a linear scan of the set<ranges>
which is bad for large range sets. The glnexus_cli
is hardcoded to supply an empty set for this argument. IIRC it's there for another use case that involved importing the gVCFs one or a few chromosomes at a time. Sorry this was a trap.
Suggestions,
import_gvcf
. Importing "extra" records into the internal database shouldn't cost much or hurt anything, as long as the exact target regions are used later for allele discovery and so on.import_gvcf
be able to use a tabix index file to select the chromosome without scanning everything else.Thanks the suggestions. We used #1, which worked reasonably well for us. Please feel free to close this.
Great, thanks for the feedback. I would like to do 2(ii) when time permits :smiley:
Exome bed file can contain close to 300k records.
std::set<range>
of this size will slow the import stage to a crawl.Local profiling shows that most of the time (~86% in this case) is spent in iterating the rb tree that backs the
std::set<range>
.Can we improve this part? Maybe
std::vector
instead ofstd::set
, and do a binary search of overlapping ranges? Or a better solution as you suggested in the comment below, it will be great if you can consider implementing the tabix build (if an index doesn't already exist) and query using htslib/tbx.h, and so we can do without thestd::set
here and use a flat container.https://github.com/dnanexus-rnd/GLnexus/blob/master/src/BCFKeyValueData.cc#L933