dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
142 stars 37 forks source link

combine gvcf files with different contigs #238

Open hurleyLi opened 3 years ago

hurleyLi commented 3 years ago

Hi @mlin I want to merge two big sets of gvcf files, one set has the contig of hs37d5, the other don't. When I tried to run glnexus_cli, I got the following error:

[7451] [2020-09-24 17:19:06.680] [GLnexus] [error] /users/hl7/SB969883.SNP.gvcf.gz Invalid: Incompatible gVCF. The reference contigs must match the database configuration exactly. (/users/hl7/SB969883.SNP.gvcf.gz)

The loading runs fine until the program sees a samples with different contigs.
I'm wondering if there is any easy way to specify the contigs that needs to be loaded, or somehow bypass this validation procedure. Previously I just add / remove the different contig names from the gvcf file, but this project is kind of big with ~10k samples from each set...
Any suggestions would be appreciated!

This might be related to #236

Thanks! Hurley

mlin commented 3 years ago

I can't think of a way to bypass it currently. Is one contig list at least a prefix of the other? I could see handling that case (as long as you give it the longer ones first). It uses htslib's name<->integer mapping internally, so I'd otherwise be reluctant to introduce the possibility of inconsistent mappings somewhere in the codebase.

hurleyLi commented 3 years ago

I tried both ways (putting the ones with the extra hs37d5 contig either before or after), and neither one works. I this the error message is clear at least

Invalid: Incompatible gVCF. The reference contigs must match the database configuration exactly.

I guess the only way is to change the header to make the two datasets match?

MeHelmy commented 4 months ago

Hi Mlin!

Thank you for your efforts. Were you able to fix this issue? I'm encountering a similar error in my work on targeted genes. For example, in one sample, I have chromosomes chr1, chr2, chr3, while in another sample, chr3 might be missing.

Thanks!