dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
142 stars 37 forks source link

N+1 calling #233

Open danchubb opened 4 years ago

danchubb commented 4 years ago

Hi,

We're attempting to use GLnexus to joint call the UKbiobank gvcfs alongside batches of in-house exomes. We will be regularly adding to this calling set. From the GLnexus paper it looks like the rocksDB can be updated with new samples for subsequent joint calling but i can't figure out how to do this using the docker container. I can get a single joint call working, but if I use the same DB with additional samples, I get a DB already present error. Is there a way to get this working? Or is it reserved for the DNAnexus version?

Thanks for your time,

Dan

mlin commented 4 years ago

Hi, yea the open-source driver program starts over from the gVCF files each time. The machinery for incremental import is actually in the codebase here, the driver program just doesn't expose a subcommand to call the routines that'd open & add to an existing database..................

danchubb commented 4 years ago

That's great. Would you be able to give me some pointers on what I would need to change?

JakeHagen commented 3 years ago

I have modified the code locally to load into a preexisting database. Is there anything in the discovering alleles and genotyping process that can be reused when adding new samples to the db, or will this always need to redone when adding new samples?

mlin commented 3 years ago

Nice! The allele discovery is an aggregation with the merge_discovered_alleles operator, so one could save the intermediates and later merge new ones in; the de/serialization routines are there, just similarly not used by glnexus_cli right now. That stated, it isn't the costliest step to begin with, so just by not repeating all the sorting & indexing of the GVCF data you'd be a big part of the way there.

JakeHagen commented 3 years ago

Interesting, ok thank you. I will probably just leave it as is for now. Once I sanity check the results a bit more, I will do a pull request.

aardes commented 3 years ago

Hi,

Is there any compiled/tested N+1 version?

I went through the code and made some modifications (mostly on cli_utils.cc ), but I still have some issues.

@mlin, I appreciate it if you could share the solution.

thanks in advance

mlin commented 3 years ago

To clarify, there's a DNAnexus-native version of GLnexus available to their customers, which handles incremental calling and distributed operations. That's not included in this open-source repo, but it's anyway built on the DNAnexus APIs for data management and batch computing. What I was able to point out above, is that some of the subroutines supporting those functions are in this repo, just not used by the open-source glnexus_cli driver program. That's as far as I can take it for now!

aardes commented 3 years ago

Thanks for the info.

JosephLalli commented 1 year ago

Hi there,

I was wondering - is this a valid way to improve variant calling? In theory, can you take a large cohort dataset such as 1000G or even UK Biobank and merge it with your own a private dataset called on a functionally equivalent pipeline?

Thanks!