broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

Merge existing GenomicsDBs #6629

Open zhpn1024 opened 4 years ago

zhpn1024 commented 4 years ago

For GVCFs, is there a way to merge two or more existing GenomicsDBs? So that we do not need to backup in case the workspace become corrupted.

mlathara commented 4 years ago

Sounds like you're asking about merging GenomicsDBs with different samples, as a different route to do incremental import. (as opposed to merging different workspaces with the same samples, but different genomic intervals)

No, this is not supported.

zhpn1024 commented 4 years ago

Yes, merging GenomicsDBs with different samples in the same region. I think it may be more efficient with parallel processing for large samples. Is it possible to add the function?

mlathara commented 4 years ago

Can you elaborate on your use case where it makes more sense to parallelize over groups of samples rather than genomic intervals? If you're worried about the memory usage or open file handles during GenomicsDBImport due to large number of samples, you can alleviate this by using the --batch-size parameter. But I'm guessing you're asking about the query step instead?

Fwiw, there's been discussion about parallelizing over genomic intervals here and here, and we're contemplating how best to enable that further beyond what can be done currently. See here and here for instance.

zhpn1024 commented 4 years ago

I'll try the --batch-size option. I also use genomic intervals, but only based on the large N gaps. Arbitrary intervals may have problems at interval junctions. There are still ~100M large intervals. So parallelize over groups of samples becomes a good choice. I suppose importing GVCF data into GenomicsDB takes much more time than merging two or more DBs, which may only need copy operations. In addition, if GenotypeGVCFs allows more than one GenomicsDB input, the merging step can be omitted.

fgvieira commented 1 year ago

And what about merging several GenomicsDBs with the same samples but over different non-overlapping regions (one per interval)?