cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 49 forks source link

Bug: Getting negative, non-one values in vcf['calldata/DP'] and vcf[calldata/AD'] #325

Closed cassiawag closed 4 years ago

cassiawag commented 4 years ago

Hello! I'm using scikit-allel version 1.2.1 on python 3.6.10. I'm reading in vcf files using vcf = allel.read_vcf(file, 'variants/POS', 'variants/REF', 'variants/ALT', 'calldata/DP', 'calldata/RD', 'calldata/AD', 'calldata/RDF', 'calldata/RDR', 'calldata/ADF', 'calldata/ADR').

I have one sample in the vcf and only one variant per position, so I call total read depth and variant supporting reads using vcf['calldata/DP'][:,0] and vcf['calldata/AD'][:,0,0]. When comparing back to the original vcf, this calls the correct values for most variants. But a few variants will have a negative number (not negative one indicating a missing alternate) but something like -27466 or -19240. These numbers are NOT the negative version of the number of reads listed in the vcf, and I can't figure out where they're coming from or why most of the variants display the correct information.

Note: I haven't identified this behavior with any other calldata fields including 'calldata/RD', 'calldata/RDF', 'calldata/RDR', 'calldata/ADF', and 'calldata/ADR'. I've been looking at lots of vcf files, and this behavior doesn't occur for every vcf file. Although I cannot find anything distinguishing about the vcf files that do result in negative numbers when read in with scikit-allel. The VCF files I'm looking at are all generated through the same pipeline, so differences would have to be with the original reads used to generate them.

Thanks for your help!

And sorry for all the opening and closing of this issue. I accidentally opened the issue before I'd finished typing it all out :/

alimanfoo commented 4 years ago

Hi @cassiawag

But a few variants will have a negative number (not negative one indicating a missing alternate) but something like -27466 or -19240. These numbers are NOT the negative version of the number of reads listed in the vcf, and I can't figure out where they're coming from or why most of the variants display the correct information.

Could this be from large numbers overflowing the data type? By default calldata/AD and calldata/DP are given an int16 data type, which means any value larger than 32767 will overflow.

cassiawag commented 4 years ago

Ahh, yes, of course! It's fixed when I specify a larger data type. Obvious now. Thanks for the troubleshooting help.

alimanfoo commented 4 years ago

No problem 😄