Closed berry08 closed 4 years ago
Hi @berry08 the "bcf1_t::errcode = 2" is an error from the htslib VCF parser which generally indicates that some FILTER/INFO/FORMAT field was not defined in the header (BCF_ERR_TAG_UNDEF). Do your input gVCF files have the complete headers defining all those fields? For example, if you're using tabix to take region slices, add "-h" so that it copies the original header too.
thanks, here are my input gvcf example @mlin sample1000000.g.vcf.txt
@berry08 it seems like "ReadPosRankSum" is the problematic INFO field which is not defined in the header.
$ bcftools view sample1000000.g.vcf.txt > /dev/null
[W::vcf_parse] INFO 'ReadPosRankSum' is not defined in the header, assuming Type=String
@berry08 can we help you further here?
Hello, I came across a similar error, can you please tell me how to add this information to the header
I have a similar error when giving to GLNexus two g.vcf files produced via Deepvariant in a Snakemake workflow. When given a single file as input, GLNexus works fine, but when I give both files I get the error:
[GLnexus] [error] deepgvcfssnake/c.g.vcf Exists: sample is currently being added (20 (deepgvcfssnake/c.g.vcf)) (this is the second sample loaded, the first one is loaded with no problems) [GLnexus] [error] Failed to bulk load into DB: Failure: One or more gVCF inputs failed validation or database loading; check log for details.
even, as I said, they work individually. The thing is that if I use the example gVCF downloaded from https://github.com/dnanexus-rnd/GLnexus/wiki/Getting-Started I have no problems.
@urano95 That's probably because the sample names in the two gVCF files collide with each other. GLnexus needs them to have unique names. See this issue for more info: https://github.com/dnanexus-rnd/GLnexus/issues/241
oh, okay. thanks for the reply
@urano95 PS I pushed a change to clarify the lousy error message for this. If you can share where the sample names should be fed into the upstream pipeline, we can probably add a warning to the DeepVariant docs (since at least two users reported hitting this pitfall so far). cc @tedyun @cmclean @AndrewCarroll
Hi @mlin @urano95 DeepVariant uses by default the sample name in the SM tag in the BAM file, and you can also use the --sample_name
flag when running DeepVariant to manually specify the sample name. I hope this help and we'd be happy to update our docs if you have any suggestion. https://github.com/google/deepvariant/blob/r1.1/scripts/run_deepvariant.py#L93
Good morining,
@mlin the Snakemake pipeline I was talking about was just a test for the DeepVariant/GLNexus combination, and I just had two paired-end samples which I renamed a_1/a_2.fastq and c_1/c_2.fastq (just for easy wildcard), producing a.g.vcf and c.g.vcf, meant to be merged in an all.g.vcf file. The name of the two samples was originally different. . But as @tedyun said, the name considered by DeepVariant is the one in the SM tag in the BAM file. I tried with Picard RenameSampleInVcf just to be sure, and it worked with both the samples at the same time. Maybe in the error message can be specified where the name should be different. Thank you all for the replies.
Thanks for the info @urano95
@tedyun That's great that DV has a command-line flag to override the SM tag. Based on the users hitting this I think it'd be worthwhile to add a small-font notice to this section https://github.com/google/deepvariant/blob/r1.1/docs/trio-merge-case-study.md#run-deepvariant-on-trio-to-get-3-single-sample-vcfs perhaps something like (suggestion)
The BAM files should provide unique names for each sample in their
SM
header tag, which is usually derived from a command-line flag to the read aligner. If your BAM files don't have uniqueSM
tags, and it's not feasible to adjust the alignment pipeline, add the--sample-name=XYZ
flag torun_deepvariant
to override the sample name written into the gVCF file header.
@mlin Thank you for the suggestion! I've added the text you suggested in our internal DeepVariant code base - it'll be available on GitHub in our next release :)
version:v1.1.5-1-gd7b4307 command:glnexus_cli --bed test2.bed --squeeze sample.g.vcf.gz >chr1.bcf test2.bed:chr1 1 3000000 sample.g.vcf.gz: there are two files in total, some lines in file are pasted as follow: chr1 1000281 . C . . END=1000284 GT:DP:GQ:MIN_DP:PL 0/0:36:99:34:0,99,1120
chr1 1000285 . C . . END=1000290 GT:DP:GQ:MIN_DP:PL 0/0:30:81:29:0,81,1215
chr1 1000291 rs116904365 C G, 100.77 . BaseQRankSum=-0.555;ClippingRankSum=0.000;DB;DP=17;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=61200.00;ReadPosRankSum=1.236 GT:AD:DP:GQ:PL:SB 0/1:11,6,0:17:99:129,0,283,161,301,463:4,7,4,2
chr1 1000292 . C . . END=1000292 GT:DP:GQ:MIN_DP:PL 0/0:19:54:19:0,54,810
running errors: [001957] [2020-01-14 21:02:50.887] [GLnexus] [error] sample1000000.g.vcf.gz IOError: reading from gVCF file (sample1000000.g.vcf.gz bcf1_t::errcode = 2; after chr1:1000285-1000290) [001957] [2020-01-14 21:02:50.887] [GLnexus] [error] sample999001.g.vcf.gz IOError: reading from gVCF file (sample999001.g.vcf.gz bcf1_t::errcode = 2; after chr1:1000075-1000078) [001957] [2020-01-14 21:02:50.897] [GLnexus] [error] Failed to bulk load into DB: Failure: One or more gVCF inputs failed validation or database loading; check log for details.