dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
145 stars 37 forks source link

an error appear when use glnexus_cli #207

Closed berry08 closed 4 years ago

berry08 commented 4 years ago

version:v1.1.5-1-gd7b4307 command:glnexus_cli --bed test2.bed --squeeze sample.g.vcf.gz >chr1.bcf test2.bed:chr1 1 3000000 sample.g.vcf.gz: there are two files in total, some lines in file are pasted as follow: chr1 1000281 . C . . END=1000284 GT:DP:GQ:MIN_DP:PL 0/0:36:99:34:0,99,1120 chr1 1000285 . C . . END=1000290 GT:DP:GQ:MIN_DP:PL 0/0:30:81:29:0,81,1215 chr1 1000291 rs116904365 C G, 100.77 . BaseQRankSum=-0.555;ClippingRankSum=0.000;DB;DP=17;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=61200.00;ReadPosRankSum=1.236 GT:AD:DP:GQ:PL:SB 0/1:11,6,0:17:99:129,0,283,161,301,463:4,7,4,2 chr1 1000292 . C . . END=1000292 GT:DP:GQ:MIN_DP:PL 0/0:19:54:19:0,54,810

running errors: [001957] [2020-01-14 21:02:50.887] [GLnexus] [error] sample1000000.g.vcf.gz IOError: reading from gVCF file (sample1000000.g.vcf.gz bcf1_t::errcode = 2; after chr1:1000285-1000290) [001957] [2020-01-14 21:02:50.887] [GLnexus] [error] sample999001.g.vcf.gz IOError: reading from gVCF file (sample999001.g.vcf.gz bcf1_t::errcode = 2; after chr1:1000075-1000078) [001957] [2020-01-14 21:02:50.897] [GLnexus] [error] Failed to bulk load into DB: Failure: One or more gVCF inputs failed validation or database loading; check log for details.

mlin commented 4 years ago

Hi @berry08 the "bcf1_t::errcode = 2" is an error from the htslib VCF parser which generally indicates that some FILTER/INFO/FORMAT field was not defined in the header (BCF_ERR_TAG_UNDEF). Do your input gVCF files have the complete headers defining all those fields? For example, if you're using tabix to take region slices, add "-h" so that it copies the original header too.

berry08 commented 4 years ago

thanks, here are my input gvcf example @mlin sample1000000.g.vcf.txt

mlin commented 4 years ago

@berry08 it seems like "ReadPosRankSum" is the problematic INFO field which is not defined in the header.

$ bcftools view sample1000000.g.vcf.txt > /dev/null
[W::vcf_parse] INFO 'ReadPosRankSum' is not defined in the header, assuming Type=String
mlin commented 4 years ago

@berry08 can we help you further here?

Ginushika commented 3 years ago

Hello, I came across a similar error, can you please tell me how to add this information to the header

nliorni commented 3 years ago

I have a similar error when giving to GLNexus two g.vcf files produced via Deepvariant in a Snakemake workflow. When given a single file as input, GLNexus works fine, but when I give both files I get the error:

[GLnexus] [error] deepgvcfssnake/c.g.vcf Exists: sample is currently being added (20 (deepgvcfssnake/c.g.vcf)) (this is the second sample loaded, the first one is loaded with no problems) [GLnexus] [error] Failed to bulk load into DB: Failure: One or more gVCF inputs failed validation or database loading; check log for details.

even, as I said, they work individually. The thing is that if I use the example gVCF downloaded from https://github.com/dnanexus-rnd/GLnexus/wiki/Getting-Started I have no problems.

mlin commented 3 years ago

@urano95 That's probably because the sample names in the two gVCF files collide with each other. GLnexus needs them to have unique names. See this issue for more info: https://github.com/dnanexus-rnd/GLnexus/issues/241

nliorni commented 3 years ago

oh, okay. thanks for the reply

mlin commented 3 years ago

@urano95 PS I pushed a change to clarify the lousy error message for this. If you can share where the sample names should be fed into the upstream pipeline, we can probably add a warning to the DeepVariant docs (since at least two users reported hitting this pitfall so far). cc @tedyun @cmclean @AndrewCarroll

tedyun commented 3 years ago

Hi @mlin @urano95 DeepVariant uses by default the sample name in the SM tag in the BAM file, and you can also use the --sample_name flag when running DeepVariant to manually specify the sample name. I hope this help and we'd be happy to update our docs if you have any suggestion. https://github.com/google/deepvariant/blob/r1.1/scripts/run_deepvariant.py#L93

nliorni commented 3 years ago

Good morining,

@mlin the Snakemake pipeline I was talking about was just a test for the DeepVariant/GLNexus combination, and I just had two paired-end samples which I renamed a_1/a_2.fastq and c_1/c_2.fastq (just for easy wildcard), producing a.g.vcf and c.g.vcf, meant to be merged in an all.g.vcf file. The name of the two samples was originally different. . But as @tedyun said, the name considered by DeepVariant is the one in the SM tag in the BAM file. I tried with Picard RenameSampleInVcf just to be sure, and it worked with both the samples at the same time. Maybe in the error message can be specified where the name should be different. Thank you all for the replies.

mlin commented 3 years ago

Thanks for the info @urano95

@tedyun That's great that DV has a command-line flag to override the SM tag. Based on the users hitting this I think it'd be worthwhile to add a small-font notice to this section https://github.com/google/deepvariant/blob/r1.1/docs/trio-merge-case-study.md#run-deepvariant-on-trio-to-get-3-single-sample-vcfs perhaps something like (suggestion)

The BAM files should provide unique names for each sample in their SM header tag, which is usually derived from a command-line flag to the read aligner. If your BAM files don't have unique SM tags, and it's not feasible to adjust the alignment pipeline, add the --sample-name=XYZ flag to run_deepvariant to override the sample name written into the gVCF file header.

tedyun commented 3 years ago

@mlin Thank you for the suggestion! I've added the text you suggested in our internal DeepVariant code base - it'll be available on GitHub in our next release :)