cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 49 forks source link

Negative GQ values after vcf_to_hdf5 #341

Open grlewis333 opened 3 years ago

grlewis333 commented 3 years ago

Hi, I was comparing some different VCF extraction packages and have been really impressed by the extraction speed in the h5 output of vcf_to_hdf5. However, I was doing some basic tests just to check outputs were the same from each method and have found some strange behaviour in the scikit-allel output.

Specifically I found that my GQ values >128 appear to have a negative offset...

Here is an example plotting GQ against sample DP and you can see the (orange) scikit-allel result has this strange offset: image

For example, row 265 of the VCF is: chr17 1005289 . A G 431133 PASS AC=6;AF=0.823;AN=6;DP=109787;FS=0;MQ=249.96;MQRankSum=5.615;QD=3.94;ReadPosRankSum=3.074;SOR=0.7 GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP 1/1:0,53:1:53:156:PASS:.:.:244,159,0:206,156,0 1/1:0,40:1:40:117:PASS:.:.:205,120,0:167.3,117.3,0 1/1:0,50:1:50:147:PASS:.:.:235,150,0:197.4,147.4,0

You can see that the correct GQ for sample 1 is 156, but the value I get through scikit-allel is -100:

allel.vcf_to_hdf5('fpath.vcf.gz', 'fpath.h5', fields='*', overwrite=True)
callset = h5py.File(fpath.h5,mode='r')
GQ = callset['calldata/GQ']
GQ_0 = GQs[:,0]
print(GQ_0[265])

>>> -100

I found 271 cases where the GQ is offset and in each cases it is offset by either 28 (256) or 211 (2048) which can't be a coincidence.

Any idea what might be causing this issue?

hardingnj commented 3 years ago

Looks like integer overflow. GQ is limited to 99, so the default dtype will reflect this. (though at time of writing can't find a source for this).

You can pass a dtype dict to the function if you have a good reason for GQ >= 128.

grlewis333 commented 3 years ago

Ah thanks very much! I just added types={'calldata/GQ': 'i2'} as a parameter in the vcf_to_hdf5 call and now it works as expected.

Cheers for the fast response!

hardingnj commented 3 years ago

Cool- nice work on the comparisons by the way- useful to have this type of thing out there. Please drop a note to the google group if you write this up into a blog post/similar.

I might reopen this- just to decide whether it's worth having an explicit check on the max value of GQ, otherwise could catch people out. Not an easy thing to trace.

What is the source of your VCF? If it's something mainstream producing GQs > 99 we should look at supporting this.

NB: https://github.com/cggh/scikit-allel/blob/5f2f73ff926e105e5a65cacfc28f2928aba70f69/allel/io/vcf_read.py#L1392 Is the relevant line that defaults to i1.