MRCIEU / gwas2vcf

Convert GWAS summary statistics to VCF
MIT License
47 stars 18 forks source link

allow for storing multiple traits per VCF (use the sample field) #15

Closed mcgml closed 5 years ago

mcgml commented 5 years ago

Also helpful for storage with GenomicsDB?

mcgml commented 5 years ago

@explodecomputer been thinking about the idea of storing multiple trait associations in a single VCF/BCF which as discussed has the following advantages:

This is what the code looks like: https://github.com/MRCIEU/gwas_harmonisation/commit/7ff3138dc8c1694ff9e6e4773fabe70c447c2dcf

And the output for LDL-c (MRB #300):

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All   filters passed">
##FORMAT=<ID=EFFECT,Number=A,Type=Float,Description="Effect   size estimate relative to the alternative allele">
##FORMAT=<ID=SE,Number=A,Type=Float,Description="Standard   error of effect size estimate">
##FORMAT=<ID=L10PVAL,Number=A,Type=Float,Description="-log10   p-value for effect estimate">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Alternate   allele frequency in the association study">
##FORMAT=<ID=N,Number=A,Type=Float,Description="Sample   size used to estimate genetic effect">
##FORMAT=<ID=ZSCORE,Number=A,Type=Float,Description="Z-score   provided if it was used to derive the EFFECT and SE fields">
##FORMAT=<ID=SIMPINFO,Number=A,Type=Float,Description="Accuracy   score of summary data imputation">
##FORMAT=<ID=PROPCASES,Number=A,Type=Float,Description="Proportion   of sample size that were cases in the GWAS">
##contig=<ID=1,length=249250621,assembly=GRCh37>
##contig=<ID=2,length=243199373,assembly=GRCh37>
##contig=<ID=3,length=198022430,assembly=GRCh37>
##contig=<ID=4,length=191154276,assembly=GRCh37>
##contig=<ID=5,length=180915260,assembly=GRCh37>
##contig=<ID=6,length=171115067,assembly=GRCh37>
##contig=<ID=7,length=159138663,assembly=GRCh37>
##contig=<ID=8,length=146364022,assembly=GRCh37>
##contig=<ID=9,length=141213431,assembly=GRCh37>
##contig=<ID=10,length=135534747,assembly=GRCh37>
##contig=<ID=11,length=135006516,assembly=GRCh37>
##contig=<ID=12,length=133851895,assembly=GRCh37>
##contig=<ID=13,length=115169878,assembly=GRCh37>
##contig=<ID=14,length=107349540,assembly=GRCh37>
##contig=<ID=15,length=102531392,assembly=GRCh37>
##contig=<ID=16,length=90354753,assembly=GRCh37>
##contig=<ID=17,length=81195210,assembly=GRCh37>
##contig=<ID=18,length=78077248,assembly=GRCh37>
##contig=<ID=19,length=59128983,assembly=GRCh37>
##contig=<ID=20,length=63025520,assembly=GRCh37>
##contig=<ID=21,length=48129895,assembly=GRCh37>
##contig=<ID=22,length=51304566,assembly=GRCh37>
##contig=<ID=X,length=155270560,assembly=GRCh37>
##contig=<ID=Y,length=59373566,assembly=GRCh37>
##contig=<ID=MT,length=16569,assembly=GRCh37>
##contig=<ID=GL000207.1,length=4262,assembly=GRCh37>
##contig=<ID=GL000226.1,length=15008,assembly=GRCh37>
##contig=<ID=GL000229.1,length=19913,assembly=GRCh37>
##contig=<ID=GL000231.1,length=27386,assembly=GRCh37>
##contig=<ID=GL000210.1,length=27682,assembly=GRCh37>
##contig=<ID=GL000239.1,length=33824,assembly=GRCh37>
##contig=<ID=GL000235.1,length=34474,assembly=GRCh37>
##contig=<ID=GL000201.1,length=36148,assembly=GRCh37>
##contig=<ID=GL000247.1,length=36422,assembly=GRCh37>
##contig=<ID=GL000245.1,length=36651,assembly=GRCh37>
##contig=<ID=GL000197.1,length=37175,assembly=GRCh37>
##contig=<ID=GL000203.1,length=37498,assembly=GRCh37>
##contig=<ID=GL000246.1,length=38154,assembly=GRCh37>
##contig=<ID=GL000249.1,length=38502,assembly=GRCh37>
##contig=<ID=GL000196.1,length=38914,assembly=GRCh37>
##contig=<ID=GL000248.1,length=39786,assembly=GRCh37>
##contig=<ID=GL000244.1,length=39929,assembly=GRCh37>
##contig=<ID=GL000238.1,length=39939,assembly=GRCh37>
##contig=<ID=GL000202.1,length=40103,assembly=GRCh37>
##contig=<ID=GL000234.1,length=40531,assembly=GRCh37>
##contig=<ID=GL000232.1,length=40652,assembly=GRCh37>
##contig=<ID=GL000206.1,length=41001,assembly=GRCh37>
##contig=<ID=GL000240.1,length=41933,assembly=GRCh37>
##contig=<ID=GL000236.1,length=41934,assembly=GRCh37>
##contig=<ID=GL000241.1,length=42152,assembly=GRCh37>
##contig=<ID=GL000243.1,length=43341,assembly=GRCh37>
##contig=<ID=GL000242.1,length=43523,assembly=GRCh37>
##contig=<ID=GL000230.1,length=43691,assembly=GRCh37>
##contig=<ID=GL000237.1,length=45867,assembly=GRCh37>
##contig=<ID=GL000233.1,length=45941,assembly=GRCh37>
##contig=<ID=GL000204.1,length=81310,assembly=GRCh37>
##contig=<ID=GL000198.1,length=90085,assembly=GRCh37>
##contig=<ID=GL000208.1,length=92689,assembly=GRCh37>
##contig=<ID=GL000191.1,length=106433,assembly=GRCh37>
##contig=<ID=GL000227.1,length=128374,assembly=GRCh37>
##contig=<ID=GL000228.1,length=129120,assembly=GRCh37>
##contig=<ID=GL000214.1,length=137718,assembly=GRCh37>
##contig=<ID=GL000221.1,length=155397,assembly=GRCh37>
##contig=<ID=GL000209.1,length=159169,assembly=GRCh37>
##contig=<ID=GL000218.1,length=161147,assembly=GRCh37>
##contig=<ID=GL000220.1,length=161802,assembly=GRCh37>
##contig=<ID=GL000213.1,length=164239,assembly=GRCh37>
##contig=<ID=GL000211.1,length=166566,assembly=GRCh37>
##contig=<ID=GL000199.1,length=169874,assembly=GRCh37>
##contig=<ID=GL000217.1,length=172149,assembly=GRCh37>
##contig=<ID=GL000216.1,length=172294,assembly=GRCh37>
##contig=<ID=GL000215.1,length=172545,assembly=GRCh37>
##contig=<ID=GL000205.1,length=174588,assembly=GRCh37>
##contig=<ID=GL000219.1,length=179198,assembly=GRCh37>
##contig=<ID=GL000224.1,length=179693,assembly=GRCh37>
##contig=<ID=GL000223.1,length=180455,assembly=GRCh37>
##contig=<ID=GL000195.1,length=182896,assembly=GRCh37>
##contig=<ID=GL000212.1,length=186858,assembly=GRCh37>
##contig=<ID=GL000222.1,length=186861,assembly=GRCh37>
##contig=<ID=GL000200.1,length=187035,assembly=GRCh37>
##contig=<ID=GL000193.1,length=189789,assembly=GRCh37>
##contig=<ID=GL000194.1,length=191469,assembly=GRCh37>
##contig=<ID=GL000225.1,length=211173,assembly=GRCh37>
##contig=<ID=GL000192.1,length=547496,assembly=GRCh37>
##gwas_harmonisation_command=--out   test/data/example.vcf --data test/data/example.h.txt --ref   /data/db/human/gatk/2.8/b37/human_g1k_v37.fasta --id 300 --json   test/example.json; 1.0.1
##file_date=2019-08-12T14:06:55.142804
##counts.total_variants=9
##counts.variants_not_read=0
##counts.harmonised_variants=9
##counts.variants_not_harmonised=0
##counts.switched_alleles=9
##cohort_sample_size=None
##cohort_frac_cases=None
##gwas.id=300
##gwas.type=continuous

CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | 300

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- 10 | 9960129 | rs4747841 | A | G | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0037:0.0052:0.145208:0.5092:89138:.:.:. 10 | 9960259 | rs4749917 | C | T | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0033:0.0052:0.11081:0.5092:89138:.:.:. 10 | 9960453 | rs12219605 | G | T | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0035:0.0052:0.131179:0.5092:89138:.:.:. 10 | 100012739 | rs737656 | A | G | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0099:0.0054:1.39794:0.6794:89888:.:.:. 10 | 100012890 | rs737657 | A | G | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0084:0.0054:1.07428:0.6794:89888:.:.:. 10 | 100013563 | rs7086391 | C | T | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0075:0.0067:0.570409:0.219:89888:.:.:. 10 | 100013815 | rs878177 | C | T | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0073:0.0055:0.861382:0.3483:89888:.:.:. 10 | 100015603 | rs3763689 | G | T | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0066:0.0063:0.489187:0.2203:89888:.:.:. 10 | 100016339 | rs1983865 | C | T | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0037:0.0052:0.469288:0.4591:89888:.:.:.

What do you think?

explodecomputer commented 5 years ago

Wow interesting! Can you give me an example of how you would extract specific queries?

Questions that arise:

explodecomputer commented 5 years ago

Ok so a query would look like this?

bcftools view -O --samples 300 --include 'L10PVAL>8'

that would extract everything with log 10 pval > 8 for dataset 300?

mcgml commented 5 years ago

Yeh that's correct, I'll get back to you with some examples


From: gibran hemani notifications@github.com Sent: 12 August 2019 15:23:54 To: MRCIEU/gwas_harmonisation gwas_harmonisation@noreply.github.com Cc: Matt Lyon matt.lyon@bristol.ac.uk; Assign assign@noreply.github.com Subject: Re: [MRCIEU/gwas_harmonisation] allow for storing multiple traits per VCF (use the sample field) (#15)

Ok so a query would look like this?

bcftools view -O --samples 300 --include 'L10PVAL>8'

that would extract everything with log 10 pval > 8 for dataset 300?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/MRCIEU/gwas_harmonisation/issues/15?email_source=notifications&email_token=ADC5GSVXSUCAUYJJ4CD2PHDQEFW7VA5CNFSM4ILA4Z2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4CWDAY#issuecomment-520446339, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADC5GSTDL4WYKMJUCZKMFN3QEFW7VANCNFSM4ILA4Z2A.

mcgml commented 5 years ago
  1. How is file size affected

44 UK Biobank BCFs = 11.24 GB individually or 4.0GB merged

[ml18692@bc4login3 vcf_08_19]$ ls -lh *bcf
-rw-rw----+ 1 ml18692 sscm 249M Aug 13 10:37 data.batch_189.bcf
-rw-rw----+ 1 ml18692 sscm 246M Aug 13 10:38 data.batch_20001_1017.bcf
-rw-rw----+ 1 ml18692 sscm 247M Aug 13 10:38 data.batch_20001_1052.bcf
-rw-rw----+ 1 ml18692 sscm 248M Aug 13 10:44 data.batch_20003_1140862382.bcf
-rw-rw----+ 1 ml18692 sscm 242M Aug 13 10:45 data.batch_20003_1140862742.bcf
-rw-rw----+ 1 ml18692 sscm 249M Aug 13 10:45 data.batch_20003_1141146234.bcf
-rw-rw----+ 1 ml18692 sscm 245M Aug 13 10:38 data.batch_20003_1141164616.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:37 data.batch_20003_1141167308.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:36 data.batch_41200_A798.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:45 data.batch_41200_J241.bcf
-rw-rw----+ 1 ml18692 sscm 245M Aug 13 10:36 data.batch_41200_M838.bcf
-rw-rw----+ 1 ml18692 sscm 243M Aug 13 10:36 data.batch_41200_T411.bcf
-rw-rw----+ 1 ml18692 sscm 243M Aug 13 10:38 data.batch_41200_W048.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:38 data.batch_41200_W443.bcf
-rw-rw----+ 1 ml18692 sscm 239M Aug 13 10:45 data.batch_41201_Y090.bcf
-rw-rw----+ 1 ml18692 sscm 242M Aug 13 10:38 data.batch_41201_Y445.bcf
-rw-rw----+ 1 ml18692 sscm 245M Aug 13 10:37 data.batch_41202_C183.bcf
-rw-rw----+ 1 ml18692 sscm 246M Aug 13 10:37 data.batch_41202_C539.bcf
-rw-rw----+ 1 ml18692 sscm 242M Aug 13 10:45 data.batch_41202_C800.bcf
-rw-rw----+ 1 ml18692 sscm 242M Aug 13 10:45 data.batch_41202_G518.bcf
-rw-rw----+ 1 ml18692 sscm 240M Aug 13 10:37 data.batch_41202_H508.bcf
-rw-rw----+ 1 ml18692 sscm 248M Aug 13 10:36 data.batch_41202_M5499.bcf
-rw-rw----+ 1 ml18692 sscm 245M Aug 13 10:47 data.batch_41202_M722.bcf
-rw-rw----+ 1 ml18692 sscm 246M Aug 13 10:44 data.batch_41202_S022.bcf
-rw-rw----+ 1 ml18692 sscm 245M Aug 13 10:38 data.batch_41202_S819.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:44 data.batch_41203_3000.bcf
-rw-rw----+ 1 ml18692 sscm 240M Aug 13 10:40 data.batch_41204_F331.bcf
-rw-rw----+ 1 ml18692 sscm 244M Aug 13 10:36 data.batch_41204_H571.bcf
-rw-rw----+ 1 ml18692 sscm 242M Aug 13 10:36 data.batch_41204_M0693.bcf
-rw-rw----+ 1 ml18692 sscm 244M Aug 13 10:37 data.batch_41204_M1300.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:44 data.batch_41204_M7983.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:36 data.batch_41204_M953.bcf
-rw-rw----+ 1 ml18692 sscm 245M Aug 13 10:37 data.batch_41204_T802.bcf
-rw-rw----+ 1 ml18692 sscm 247M Aug 13 10:44 data.batch_41204_Y425.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:36 data.batch_41204_Z548.bcf
-rw-rw----+ 1 ml18692 sscm 245M Aug 13 10:37 data.batch_41204_Z967.bcf
-rw-rw----+ 1 ml18692 sscm 247M Aug 13 10:37 data.batch_41210_L913.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:37 data.batch_41210_O192.bcf
-rw-rw----+ 1 ml18692 sscm 247M Aug 13 10:44 data.batch_41210_S472.bcf
-rw-rw----+ 1 ml18692 sscm 243M Aug 13 10:36 data.batch_41210_T867.bcf
-rw-rw----+ 1 ml18692 sscm 245M Aug 13 10:45 data.batch_41210_U221.bcf
-rw-rw----+ 1 ml18692 sscm 248M Aug 13 10:44 data.batch_41210_Z846.bcf
-rw-rw----+ 1 ml18692 sscm 247M Aug 13 10:36 data.batch_41210_Z944.bcf
-rw-rw----+ 1 ml18692 sscm 248M Aug 13 10:44 data.batch_6156_15.bcf
-rw-rw----+ 1 ml18692 sscm 4.0G Aug 13 11:15 merged.bcf
  1. How is query performance affected

There is a significant performance hit when BCFs are merged:

[ml18692@bc4login1 vcf_08_19]$ time bcftools query -i 'INFO/L10PVAL > 7.3' -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%EFFECT\t%SE\t%L10PVAL\n' ../vcf_04_19/data.batch_41210_Z944.bcf  > query.indiv.info.txt

real    0m7.292s
user    0m7.234s
sys 0m0.058s
[ml18692@bc4login1 vcf_08_19]$ time bcftools query -i 'FORMAT/LP > 7.3' -f '%CHROM\t%POS\t%ID\t%REF\t%ALT[\t%ES\t%SE\t%LP]\n' data.batch_41210_Z944.bcf > query.indiv.txt

real    0m5.761s
user    0m5.693s
sys 0m0.068s
[ml18692@bc4login1 vcf_08_19]$ time bcftools query -s "UKB-b:1002" -i 'FORMAT/LP > 7.3' -f '%CHROM\t%POS\t%ID\t%REF\t%ALT[\t%ES\t%SE\t%LP]\n' merged.bcf > query.merge.txt

real    0m37.360s
user    0m36.454s
sys 0m0.906s
[ml18692@bc4login1 vcf_08_19]$ md5sum query.*txt
293f1146b74dbf2b01e9ef5e2c24573b  query.indiv.info.txt
293f1146b74dbf2b01e9ef5e2c24573b  query.indiv.txt
293f1146b74dbf2b01e9ef5e2c24573b  query.merge.txt
  1. Would we have to store multi allele sites as separate lines (presumably not, but if we did is that a big problem?

No. VCF supports duplicate positions and multi allelic rows so both are valid. However, whatever software we write should support both

Thoughts:

Shall we use this VCF spec but not merge? What do you think?

explodecomputer commented 5 years ago

Thanks for this. I think this basically reconciles two problems - performance for single datasets and storage for omic datasets.

Is the difference in time between info and format consistent?

mcgml commented 5 years ago

Yeh appears to be consistent, I ran it a few times. I shorted the keys i.e INFO/L10PVAL -> FORMAT/LP which might explain the difference.

How do you want to proceed?

explodecomputer commented 5 years ago

The use of the sample data seems convincing.

I'm not sure what the downside is?

Any chance you can point me to the files you generated? Are they in a shared location? No hurry on this

mcgml commented 5 years ago

Thanks Gib, the BCF files are here on BC4: /mnt/storage/private/mrcieu/research/scratch/IGD/data/public/ukbiobank/vcf_08_19

I'm really keen to start using BCFs for my research

mcgml commented 5 years ago

No downsides that I'm aware of

explodecomputer commented 5 years ago

Let's go with it. I'll contact Andrew and Eduardo to discuss it. It looks like genomicsdb is no longer under maintenance and tiledb seems to have support planned for ingestion from vcf but not implemented yet. But perhaps it will be viable by the time we come round to thinking about that.

mcgml commented 5 years ago

Great thanks, I'll merge in.

I think I gave the wrong link for genomicsdb, it's still alive here: https://github.com/GenomicsDB/GenomicsDB

mcgml commented 5 years ago

They provide a Dockerfile so we could test it out pretty quickly but agree it's very low priority

explodecomputer commented 5 years ago

btw do you have info about why you don't like snpeff?

mcgml commented 5 years ago
  1. The VCF alternative allele string may change during merge/subset operations. For details see here:

https://www.cureffi.org/2014/04/24/converting-genetic-variants-to-their-minimal-representation/

https://genome.sph.umich.edu/wiki/Variant_Normalization

SnpEff stores the annotation in a string in the INFO field (ANN format). The only way to assign the correct annotation to a multi allelic site is by matching the allele string from the ANN field with the alternative allele. Since the latter can change, the annotation is lost.

  1. The ANN format stores annotations in a piped string -- no confidence in the correct number of subfields, numerics may be invalid, cannot query etc.

Happy to discuss

mcgml commented 5 years ago

Example

Merged GWAS VCF annotated with SnpEFF:

CHROM | POS | ID | REF | ALT | QUA | FILTER | INFO | FORMAT | GWAS1 | GWAS2 | GWAS3

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- chr1 | 1001 | . | CTCC | CCC,C,CCCC | . | . | ANN=CCC|…,C|…,CCCC|… | EFFECT | .:0.01:0.01 | 0.05:.:. | 0.02:.:.

Selecting a single study gives:

CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | GWAS1

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- chr1 | 1001 | . | CTCC | C | . | . | ANN=CCC|…,C|…,CCCC|… | EFFECT | 0.01 chr1 | 1002 | . | T | C | . | . | ANN=CCC|…,C|…,CCCC|… | EFFECT | 0.01

The last row matches the WRONG annotation

mcgml commented 5 years ago

Managed to import our GWAS BCFs into GenomicsDB (Docker) which accepted the files with a couple of tweaks:

  1. converting to bgzip VCF & tbi index
  2. normalising rows to prevent duplicate positions / convert bi-allelic to multi-allelic

In principle it works, will be interesting to test performance on a dedicated cluster

explodecomputer commented 5 years ago

Thanks for this @mcgml I've been looking at tools for manipulating the bcf files e.g. in python and R. I think this could be really useful https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html

However it seems to have way more functionality with tabix indexed vcf.gz files rather than csi indexed bcf files.

I'm just wondering if we should use vcf.gz. It would sacrifice some speed performance for

Do you have thoughts on this?

mcgml commented 5 years ago

Thanks @explodecomputer looks great, much more powerful than vcfR.

In my experience BCF isn't used much and I'm having to convert to VCF for almost every operation outside of bcftools. Would definitely be easier to use bgzip VCF. Might help with adoption too.

Perhaps we need to consider some use cases and evaluate if the performance hit is likely to be an issue? For example I don't plan on querying by P value much if tophits are already flagged. Fast dbSNP lookups would be good but can use SNPbase instead.

The harmonisation tool can write to both but we would need to update the pipeline elsewhere.