Closed mlin closed 4 years ago
@mlin One question on unifying sites. Your manuscript mentions that GLnexus_ parallelizes each stage — loading by file, unification by chromosome, and genotyping by site
For unification to work, do we need to pass unified_sites()
the discovered alleles for the entire chromosome of the entire cohort? Chr1 can be rather long. If we chop the long chromosome into several non-overlapping ranges, and call discover_alleles()
and then unified_sites()
for each of those ranges, is the result still valid? More specifically, does GLnexus handle danglers correctly in unified_sites()
? Thanks.
If you were to chop the chromosome at some arbitrary points (e.g. every 10Mbp) then the unifier is not guaranteed to give the same results as when it sees everything at once, though it would take quite bad luck for this to throw off downstream analyses in any severe way. (The results would not be wrong, just different.)
An alternate strategy is to cut the chromosomes more strategically at positions where the underlying sequencing data aren't really informative anyway, namely low-complexity regions of a kilobase or longer, and around assembly gaps. Then any edge discrepancies usually get filtered out downstream anyway, on purpose.
That stated, the allele unifier is the least-expensive stage by far, since it operates on aggregates of the dataset growing roughly with sqrt(N). So we haven't found such fine-grain parallelization needed there to date. The chromosome parallelization was just left out when we threw together the open-source driver program, so far.
Done in d701cfb
Here the
glnexus_cli
driver runs the allele unifier by looping over the chromosomes serially: https://github.com/dnanexus-rnd/GLnexus/blob/74f427950c325ef58a283dc8ce2008c8b46458c6/cli/glnexus_cli.cc#L112-L125This causes an embarrassing single-threaded stage which is quite noticeable on WGS datasets (less an issue for WES). It should be easy to parallelize.