MRCIEU / gwas2vcf

Convert GWAS summary statistics to VCF
MIT License
47 stars 18 forks source link

VCF cannot store floats < 1e-6 #8

Closed mcgml closed 5 years ago

mcgml commented 5 years ago

@explodecomputer @elswob We already encountered this issue for storing p-values but there are some cases where the effect size is less than 1e-6 in UKBB. They would end up as 0 in the VCF file. Should we log transform or leave as is?

mcgml commented 5 years ago

changed effect to string

mcgml commented 5 years ago

@explodecomputer we need to discuss this

explodecomputer commented 5 years ago

thanks matt could you please send me a path to an example file that has this issue, i'll put something in the calendar to discuss. anybody else need to be involved?

mcgml commented 5 years ago

@explodecomputer it's these BCF files on bc4 /mnt/storage/home/ml18692/ukbiobank/vcf_03_19. For example have a look at slurm-1928986.out. The git commit to create the files was dac862e626a2e560a8ee715ab1753dd2469c291d

explodecomputer commented 5 years ago

@mcgml i don't seem to have read access, can you drop them here /mnt/storage/private/mrcieu/research/mr-eve/scratch please?

mcgml commented 5 years ago

I don't have write permission. Here is the relevant data: slurm-1928986.out:2019-03-19 17:12:08,772 WARNING Effect field smaller than VCF specification. Expect loss of precision for: 5.89127e-07

data.batch_100001.txt.gz: SNP CHR BP GENPOS ALLELE1 ALLELE0 A1FREQ INFO CHISQ_LINREG P_LINREG BETA SE CHISQ_BOLT_LMM_INF P_BOLT_LMM_INF rs926250 1 9374375 0.191709 G A 0.284393 0.989487 0.000244052 9.9E-01 -5.89127e-07 0.00603322 9.53497e-09 1.0E+00

In this case the effect size is 5.89127e-07 which is below the VCF spec for floats

explodecomputer commented 5 years ago

if all the effects are like this then there is a problem but otherwise, more than like 4 decimal places is not really necessary. i can look into it more, try again with this directory? /mnt/storage/private/mrcieu/research/mr-eve/scratch

mcgml commented 5 years ago

Copied across! The vast majority have a least one row like this:

[ml18692@bc4login2 vcf_03_19]$ grep "field smaller than VCF specification" slurm-* | cut -d: -f1 | sort -u | wc -l
80
[ml18692@bc4login2 vcf_03_19]$ ls slurm-* | wc -l
84

But only a few rows genome-wide

mcgml commented 5 years ago

Does colocalisation analysis work with imprecise floats?

explodecomputer commented 5 years ago

Thanks I had a look, this looks totally fine. If all the files giving problems are of this ilk then there is no issue with it being rounded to 0. Colocalisation won't be affected by that sort of loss of precision

mcgml commented 5 years ago

Great, thanks. I guess it's possible we might encounter the same for SE in the future. Is that also OK? I will switch back to float and round at 0.

explodecomputer commented 5 years ago

if the se is really close to 0 then it would have to be a really massive effect, it should be quite unlikely. but if it is too small then i would opt to round it to the smallest floating point value (1e-6?)

mcgml commented 5 years ago

OK thanks will do

mcgml commented 5 years ago

@explodecomputer are you happy with this: b596286b6cc206d9b7296b2b919f2020cf3158cd

explodecomputer commented 5 years ago

@mcgml magnificent