Open zhpn1024 opened 4 years ago
Sounds like you're asking about merging GenomicsDBs with different samples, as a different route to do incremental import. (as opposed to merging different workspaces with the same samples, but different genomic intervals)
No, this is not supported.
Yes, merging GenomicsDBs with different samples in the same region. I think it may be more efficient with parallel processing for large samples. Is it possible to add the function?
Can you elaborate on your use case where it makes more sense to parallelize over groups of samples rather than genomic intervals? If you're worried about the memory usage or open file handles during GenomicsDBImport due to large number of samples, you can alleviate this by using the --batch-size
parameter. But I'm guessing you're asking about the query step instead?
Fwiw, there's been discussion about parallelizing over genomic intervals here and here, and we're contemplating how best to enable that further beyond what can be done currently. See here and here for instance.
I'll try the --batch-size option. I also use genomic intervals, but only based on the large N gaps. Arbitrary intervals may have problems at interval junctions. There are still ~100M large intervals. So parallelize over groups of samples becomes a good choice. I suppose importing GVCF data into GenomicsDB takes much more time than merging two or more DBs, which may only need copy operations. In addition, if GenotypeGVCFs allows more than one GenomicsDB input, the merging step can be omitted.
And what about merging several GenomicsDBs with the same samples but over different non-overlapping regions (one per interval)?
For GVCFs, is there a way to merge two or more existing GenomicsDBs? So that we do not need to backup in case the workspace become corrupted.