Open grlewis333 opened 3 years ago
Looks like integer overflow.
GQ is limited to 99
, so the default dtype will reflect this.
(though at time of writing can't find a source for this).
You can pass a dtype
dict to the function if you have a good reason for GQ >= 128.
Ah thanks very much! I just added types={'calldata/GQ': 'i2'}
as a parameter in the vcf_to_hdf5 call and now it works as expected.
Cheers for the fast response!
Cool- nice work on the comparisons by the way- useful to have this type of thing out there. Please drop a note to the google group if you write this up into a blog post/similar.
I might reopen this- just to decide whether it's worth having an explicit check on the max value of GQ, otherwise could catch people out. Not an easy thing to trace.
What is the source of your VCF? If it's something mainstream producing GQs > 99 we should look at supporting this.
NB:
https://github.com/cggh/scikit-allel/blob/5f2f73ff926e105e5a65cacfc28f2928aba70f69/allel/io/vcf_read.py#L1392
Is the relevant line that defaults to i1
.
Hi, I was comparing some different VCF extraction packages and have been really impressed by the extraction speed in the h5 output of vcf_to_hdf5. However, I was doing some basic tests just to check outputs were the same from each method and have found some strange behaviour in the scikit-allel output.
Specifically I found that my GQ values >128 appear to have a negative offset...
Here is an example plotting GQ against sample DP and you can see the (orange) scikit-allel result has this strange offset:
For example, row 265 of the VCF is: chr17 1005289 . A G 431133 PASS AC=6;AF=0.823;AN=6;DP=109787;FS=0;MQ=249.96;MQRankSum=5.615;QD=3.94;ReadPosRankSum=3.074;SOR=0.7 GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP 1/1:0,53:1:53:156:PASS:.:.:244,159,0:206,156,0 1/1:0,40:1:40:117:PASS:.:.:205,120,0:167.3,117.3,0 1/1:0,50:1:50:147:PASS:.:.:235,150,0:197.4,147.4,0
You can see that the correct GQ for sample 1 is 156, but the value I get through scikit-allel is -100:
I found 271 cases where the GQ is offset and in each cases it is offset by either 28 (256) or 211 (2048) which can't be a coincidence.
Any idea what might be causing this issue?