dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
150 stars 38 forks source link

parallelize allele unifier across chromosomes #148

Closed mlin closed 4 years ago

mlin commented 5 years ago

Here the glnexus_cli driver runs the allele unifier by looping over the chromosomes serially: https://github.com/dnanexus-rnd/GLnexus/blob/74f427950c325ef58a283dc8ce2008c8b46458c6/cli/glnexus_cli.cc#L112-L125

This causes an embarrassing single-threaded stage which is quite noticeable on WGS datasets (less an issue for WES). It should be easy to parallelize.

xunjieli commented 5 years ago

@mlin One question on unifying sites. Your manuscript mentions that GLnexus_ parallelizes each stage — loading by file, unification by chromosome, and genotyping by site

For unification to work, do we need to pass unified_sites() the discovered alleles for the entire chromosome of the entire cohort? Chr1 can be rather long. If we chop the long chromosome into several non-overlapping ranges, and call discover_alleles() and then unified_sites() for each of those ranges, is the result still valid? More specifically, does GLnexus handle danglers correctly in unified_sites() ? Thanks.

mlin commented 5 years ago

If you were to chop the chromosome at some arbitrary points (e.g. every 10Mbp) then the unifier is not guaranteed to give the same results as when it sees everything at once, though it would take quite bad luck for this to throw off downstream analyses in any severe way. (The results would not be wrong, just different.)

An alternate strategy is to cut the chromosomes more strategically at positions where the underlying sequencing data aren't really informative anyway, namely low-complexity regions of a kilobase or longer, and around assembly gaps. Then any edge discrepancies usually get filtered out downstream anyway, on purpose.

That stated, the allele unifier is the least-expensive stage by far, since it operates on aggregates of the dataset growing roughly with sqrt(N). So we haven't found such fine-grain parallelization needed there to date. The chromosome parallelization was just left out when we threw together the open-source driver program, so far.

mlin commented 4 years ago

Done in d701cfb