dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
142 stars 37 forks source link

Deepvariant vfc merge fail with docker #241

Closed loipf closed 3 years ago

loipf commented 3 years ago

hi,

similar to #232 , I get the same error, but while I am using the latest docker image. the .vcf files are created with Deepvariant 1.0.0 and I am using the image quay.io/mlin/glnexus:v1.2.7 the "getting started" example works perfectly in the image. the only difference is I am reading the files from separate folders but this should not matter.

the 3 small example vcf files used (on MT chrom): error_vcf.zip

the (default) command used:

docker run --rm -i  \
    -v /path/error_vcf:"/in"  \
    quay.io/mlin/glnexus:v1.2.7  \
    bash -c "glnexus_cli --config DeepVariant /in/*/*.g.vcf.gz" # > glnexus_out.bcf

please ask if you need additional information, thank you for your help

the error:

[1] [2020-10-15 09:08:30.374] [GLnexus] [info] glnexus_cli release v1.2.7-0-g0e74fc4 Aug 13 2020
[1] [2020-10-15 09:08:30.374] [GLnexus] [info] detected jemalloc 5.2.1-0-gea6b3e973b477b8061e0076bb257dbd7f3faa756
[1] [2020-10-15 09:08:30.374] [GLnexus] [info] Loading config preset DeepVariant
[1] [2020-10-15 09:08:30.379] [GLnexus] [info] config:
unifier_config:
  drop_filtered: false
  min_allele_copy_number: 1
  min_AQ1: 10
  min_AQ2: 10
  min_GQ: 0
  max_alleles_per_site: 32
  monoallelic_sites_for_lost_alleles: true
  preference: common
genotyper_config:
  revise_genotypes: true
  min_assumed_allele_frequency: 9.99999975e-05
  required_dp: 0
  allow_partial_data: true
  allele_dp_format: AD
  ref_dp_format: MIN_DP
  output_residuals: false
  more_PL: true
  squeeze: false
  trim_uncalled_alleles: true
  output_format: BCF
  liftover_fields:
    - {orig_names: [MIN_DP, DP], name: DP, description: "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read depth (reads with MQ=255 or with bad mates are filtered)\">", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true}
    - {orig_names: [AD], name: AD, description: "##FORMAT=<ID=AD,Number=R,Type=Integer,Description=\"Allelic depths for the ref and alt alleles in the order listed\">", type: int, number: alleles, default_type: zero, count: 0, combi_method: min, ignore_non_variants: false}
    - {orig_names: [GQ], name: GQ, description: "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true}
    - {orig_names: [PL], name: PL, description: "##FORMAT=<ID=PL,Number=G,Type=Integer,Description=\"Phred-scaled genotype Likelihoods\">", type: int, number: genotype, default_type: missing, count: 0, combi_method: missing, ignore_non_variants: true}
[1] [2020-10-15 09:08:30.379] [GLnexus] [info] config CRC32C = 2857227159
[1] [2020-10-15 09:08:30.379] [GLnexus] [info] init database, exemplar_vcf=/in/DRR023395/DRR023395.g.vcf.gz
[1] [2020-10-15 09:08:30.485] [GLnexus] [info] Initialized GLnexus database in GLnexus.DB
[1] [2020-10-15 09:08:30.485] [GLnexus] [info] bucket size: 30000
[1] [2020-10-15 09:08:30.485] [GLnexus] [info] contigs: MT
[1] [2020-10-15 09:08:30.525] [GLnexus] [info] db_get_contigs GLnexus.DB
[1] [2020-10-15 09:08:30.600] [GLnexus] [info] Beginning bulk load with no range filter.
[1] [2020-10-15 09:08:30.609] [GLnexus] [info] Loaded 2 datasets with 1 samples; 126952 bytes in 1431 BCF records (0 duplicate) in 2 buckets. Bucket max 99648 bytes, 1121 records. 0 BCF records skipped due to caller-specific exceptions
[1] [2020-10-15 09:08:30.609] [GLnexus] [info] Created sample set *@2
[1] [2020-10-15 09:08:30.609] [GLnexus] [error] /in/ERR1275204/ERR1275204.g.vcf.gz Exists: sample is currently being added (default (/in/ERR1275204/ERR1275204.g.vcf.gz))
[1] [2020-10-15 09:08:30.672] [GLnexus] [error] Failed to bulk load into DB: Failure: One or more gVCF inputs failed validation or database loading; check log for details.
mlin commented 3 years ago

In the GVCF files, the sample ID needs to be set to a unique identifier. Currently they all seem to be set to "default" (from your zip):

image

(rightmost of the third line)

You can rewrite the headers pretty efficiently using bcftools reheader but better would be to adjust the upstream pipeline to put the sample IDs there in the first place. DeepVariant might be filling it from the SAM/BAM/CRAM file header (RG SM:default?) which in turn was an input to the read aligner (I'm not certain of that though).

loipf commented 3 years ago

this solved the problem! thank you very much for the detailed answer, also with the reheader info

sunbacteria commented 2 years ago

hi,

similar to https://github.com/dnanexus-rnd/GLnexus/issues/241 , I get the same error. However, after I check the gvcf file, it seems to me that my sample ID is unique enough, and i wondered whether there is any other possibility for this condition.

Part of my error report in log file is shown as follows: [85597] [2022-03-09 13:57:34.574] [GLnexus] [info] config CRC32C = 3285998180 [85597] [2022-03-09 13:57:34.574] [GLnexus] [info] init database, exemplar_vcf=/input/DD19003942_C_WES.g.vcf.gz [85597] [2022-03-09 13:57:34.932] [GLnexus] [info] Initialized GLnexus database in GLnexus.DB [85597] [2022-03-09 13:57:34.932] [GLnexus] [info] bucket size: 30000 [85597] [2022-03-09 13:57:34.932] [GLnexus] [info] contigs: chr1 chr2 chr3 chr4 ..... [85597] [2022-03-09 13:57:34.960] [GLnexus] [info] db_get_contigs GLnexus.DB [85597] [2022-03-09 13:57:35.195] [GLnexus] [info] Beginning bulk load with no range filter. [E::hts_open_format] Failed to open file /intput/DD19003942_M_WES.g.vcf.gz [E::hts_open_format] Failed to open file /intput/DD19003942_F_WES.g.vcf.gz [85597] [2022-03-09 13:57:55.301] [GLnexus] [info] Loaded 1 datasets with 1 samples; 970692232 bytes in 11262072 BCF records (4 duplicate) in 97672 buckets. Bucket max 365608 bytes, 3915 records. 0 BCF records skipped due to caller-specific exceptions [85597] [2022-03-09 13:57:55.303] [GLnexus] [info] Created sample set *@1 [85597] [2022-03-09 13:57:55.303] [GLnexus] [error] /intput/DD19003942_F_WES.g.vcf.gz IOError: opening gVCF file (/intput/DD19003942_F_WES.g.vcf.gz) [85597] [2022-03-09 13:57:55.304] [GLnexus] [error] /intput/DD19003942_M_WES.g.vcf.gz IOError: opening gVCF file (/intput/DD19003942_M_WES.g.vcf.gz) [85597] [2022-03-09 13:57:59.318] [GLnexus] [error] Failed to bulk load into DB: Failure: One or more gVCF inputs failed validation or database loading; check log for details. Failed to read from standard input: unknown file type