brentp / cyvcf2

cython + htslib == fast VCF and BCF processing
MIT License
376 stars 72 forks source link

Bug: cyvcf2 does not raise an exception for an improperly formatted VCF #161

Open nh13 opened 4 years ago

nh13 commented 4 years ago

If the PS format field has type Integer but the values are strings, then cyvcf2 will skip the variant but not throw an exception. I would have expected, by default, that the underlying error to propagate up as an exception. I'd be ok with skipping the variant if I had set some option to be "lenient". If I change the type to String all works fine, but the silent error did cause some strange downstream results.

>>> from cyvcf2 import VCF as VcfReader
>>> reader = VcfReader("fail.vcf")
>>> list(reader)
[E::vcf_parse_format] Invalid character 'P' in 'PS' FORMAT field at 1:4001310
[]

You can see the VCF for the test above here:

`fail.vcf` ``` ##fileformat=VCFv4.2 ##CL=vcffilter -i filtered-phase-transfer.vcf.gz -o - --javascript "ensureFormatHeader(\"##FORMAT=\"); function record() {if(INTEGRATION.GT==\"1/1\") { INTEGRATION.IPS=\".\"; INTEGRATION.PS=\"HOMVAR\"; INTEGRATION.GT=\"1|1\";} else {if((INTEGRATION.GT==\"0/1\" || INTEGRATION.GT==\"1/2\" || INTEGRATION.GT==\"2/1\" || INTEGRATION.GT==\"1/0\") ) {if(INTEGRATION.IPS.length>1) {INTEGRATION.PS=INTEGRATION.IPS; INTEGRATION.GT=INTEGRATION.IGT;} else {INTEGRATION.PS=\".\";};} else { if((INTEGRATION.IPS.length<2)) { INTEGRATION.IPS=\".\";} INTEGRATION.PS=\"PATMAT\";};};}" ##FILTER= ##FILTER=0.8"> ##FILTER= ##FILTER= ##FILTER= ##FILTER= ##FILTER= ##FILTER= ##FILTER= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##RUN-ID=16dacf15-fdc9-4199-84bd-723ea8bcddef ##contig= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG001 1 4001310 . G T 50 PASS callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo,SolidSE75GATKHC;callsets=5;datasetnames=HiSeqPE300x,CGnormal,10XChromium,SolidSE75bp;datasets=4;datasetsmissingcall=IonExome,SolidPE50x50bp;platformnames=Illumina,CG,10X,Solid;platforms=4 GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0|1:571:133,124:160,157:1011:0|1:4001310_G_T:PATMAT ```

You can see the original GRCh38 NA12878 VCF below. You'll see that the type for the PS format field is set as String, which is not valid according to the VCF spec (sigh). I'll have to report that some where. But even so, I'd expected the parsing error to be propagated.

Getting the original VCF ```bash bcftools view https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh38/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz chr1:4001310-4001310 | grep PS ##CL=vcffilter -i filtered-phase-transfer.vcf.gz -o - --javascript "ensureFormatHeader(\"##FORMAT=\"); function record() {if(INTEGRATION.GT==\"1/1\") { INTEGRATION.IPS=\".\"; INTEGRATION.PS=\"HOMVAR\"; INTEGRATION.GT=\"1|1\";} else {if((INTEGRATION.GT==\"0/1\" || INTEGRATION.GT==\"1/2\" || INTEGRATION.GT==\"2/1\" || INTEGRATION.GT==\"1/0\") ) {if(INTEGRATION.IPS.length>1) {INTEGRATION.PS=INTEGRATION.IPS; INTEGRATION.GT=INTEGRATION.IGT;} else {INTEGRATION.PS=\".\";};} else { if((INTEGRATION.IPS.length<2)) { INTEGRATION.IPS=\".\";} INTEGRATION.PS=\"PATMAT\";};};}" ##INFO= ##FORMAT= ##FORMAT= chr1 4001310 . G T 50 PASS platforms=4;platformnames=Illumina,CG,10X,Solid;datasets=4;datasetnames=HiSeqPE300x,CGnormal,10XChromium,SolidSE75bp;callsets=5;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo,SolidSE75GATKHC;datasetsmissingcall=IonExome,SolidPE50x50bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0|1:571:133,124:160,157:1011:0|1:4001310_G_T:PATMAT ```
brentp commented 4 years ago

hmm. yeah. this needs to be addressed here: https://github.com/brentp/cyvcf2/blob/2dd88453ff4ec619311f860a2df4651f8b2ccf04/cyvcf2/cyvcf2.pyx#L572-L577

htslib returns -1 from bcf_read in this case, and prints a message. cyvcf2 currently just stops iteration.

nh13 commented 4 years ago

@brentp how about a param to cyvcf2 to either 'stop', 'keep going', or 'throw an exception'? Similar to picard's strict, lenient, or silent.

brentp commented 4 years ago

So it looks like the way to handle this in cyvcf2 is to check the errcode which gets set when there is a problem.

It is set to one of these possible values AFAICT: https://github.com/samtools/htslib/blob/4162046b28a7d9d8a104ce28086d9467cc05c212/htslib/vcf.h#L193-L199

I think we can ignore invalid contig and raise exception on others.

brentp commented 4 years ago

i think it's simpler for the user if it fails on a < 0 error except those related to undefined contigs.

brentp commented 4 years ago

closed in #162

dbolser commented 1 year ago

Should this be closed here?

dbolser commented 1 year ago

@nh13 can you add a very simple fail.vcf to the tests and check that it fails correctly?

nh13 commented 1 year ago

@dbolser would you like to reopen this issue, and assign it to me?

dbolser commented 1 year ago

@dbolser would you like to reopen this issue, and assign it to me?

Hi @nh13, thanks for getting back to me on this.

I'm not actually a project admin, I just like opensource software community development, so I don't feel shy bossing people about ;-)

@brentp, can you reopen this issue and assign it to @nh13 please?

@nh13, please let us know if you have any problem figuring out how to add the test into the existing framework and how to submit a PR.

Cheers both!

nh13 commented 1 year ago

Thanks, I’m unlikely to get to this anytime soon, so if others want to add the test VCF before I do, I won’t be offended.