Closed mcgml closed 5 years ago
@explodecomputer been thinking about the idea of storing multiple trait associations in a single VCF/BCF which as discussed has the following advantages:
This is what the code looks like: https://github.com/MRCIEU/gwas_harmonisation/commit/7ff3138dc8c1694ff9e6e4773fabe70c447c2dcf
And the output for LDL-c (MRB #300):
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=EFFECT,Number=A,Type=Float,Description="Effect size estimate relative to the alternative allele">
##FORMAT=<ID=SE,Number=A,Type=Float,Description="Standard error of effect size estimate">
##FORMAT=<ID=L10PVAL,Number=A,Type=Float,Description="-log10 p-value for effect estimate">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Alternate allele frequency in the association study">
##FORMAT=<ID=N,Number=A,Type=Float,Description="Sample size used to estimate genetic effect">
##FORMAT=<ID=ZSCORE,Number=A,Type=Float,Description="Z-score provided if it was used to derive the EFFECT and SE fields">
##FORMAT=<ID=SIMPINFO,Number=A,Type=Float,Description="Accuracy score of summary data imputation">
##FORMAT=<ID=PROPCASES,Number=A,Type=Float,Description="Proportion of sample size that were cases in the GWAS">
##contig=<ID=1,length=249250621,assembly=GRCh37>
##contig=<ID=2,length=243199373,assembly=GRCh37>
##contig=<ID=3,length=198022430,assembly=GRCh37>
##contig=<ID=4,length=191154276,assembly=GRCh37>
##contig=<ID=5,length=180915260,assembly=GRCh37>
##contig=<ID=6,length=171115067,assembly=GRCh37>
##contig=<ID=7,length=159138663,assembly=GRCh37>
##contig=<ID=8,length=146364022,assembly=GRCh37>
##contig=<ID=9,length=141213431,assembly=GRCh37>
##contig=<ID=10,length=135534747,assembly=GRCh37>
##contig=<ID=11,length=135006516,assembly=GRCh37>
##contig=<ID=12,length=133851895,assembly=GRCh37>
##contig=<ID=13,length=115169878,assembly=GRCh37>
##contig=<ID=14,length=107349540,assembly=GRCh37>
##contig=<ID=15,length=102531392,assembly=GRCh37>
##contig=<ID=16,length=90354753,assembly=GRCh37>
##contig=<ID=17,length=81195210,assembly=GRCh37>
##contig=<ID=18,length=78077248,assembly=GRCh37>
##contig=<ID=19,length=59128983,assembly=GRCh37>
##contig=<ID=20,length=63025520,assembly=GRCh37>
##contig=<ID=21,length=48129895,assembly=GRCh37>
##contig=<ID=22,length=51304566,assembly=GRCh37>
##contig=<ID=X,length=155270560,assembly=GRCh37>
##contig=<ID=Y,length=59373566,assembly=GRCh37>
##contig=<ID=MT,length=16569,assembly=GRCh37>
##contig=<ID=GL000207.1,length=4262,assembly=GRCh37>
##contig=<ID=GL000226.1,length=15008,assembly=GRCh37>
##contig=<ID=GL000229.1,length=19913,assembly=GRCh37>
##contig=<ID=GL000231.1,length=27386,assembly=GRCh37>
##contig=<ID=GL000210.1,length=27682,assembly=GRCh37>
##contig=<ID=GL000239.1,length=33824,assembly=GRCh37>
##contig=<ID=GL000235.1,length=34474,assembly=GRCh37>
##contig=<ID=GL000201.1,length=36148,assembly=GRCh37>
##contig=<ID=GL000247.1,length=36422,assembly=GRCh37>
##contig=<ID=GL000245.1,length=36651,assembly=GRCh37>
##contig=<ID=GL000197.1,length=37175,assembly=GRCh37>
##contig=<ID=GL000203.1,length=37498,assembly=GRCh37>
##contig=<ID=GL000246.1,length=38154,assembly=GRCh37>
##contig=<ID=GL000249.1,length=38502,assembly=GRCh37>
##contig=<ID=GL000196.1,length=38914,assembly=GRCh37>
##contig=<ID=GL000248.1,length=39786,assembly=GRCh37>
##contig=<ID=GL000244.1,length=39929,assembly=GRCh37>
##contig=<ID=GL000238.1,length=39939,assembly=GRCh37>
##contig=<ID=GL000202.1,length=40103,assembly=GRCh37>
##contig=<ID=GL000234.1,length=40531,assembly=GRCh37>
##contig=<ID=GL000232.1,length=40652,assembly=GRCh37>
##contig=<ID=GL000206.1,length=41001,assembly=GRCh37>
##contig=<ID=GL000240.1,length=41933,assembly=GRCh37>
##contig=<ID=GL000236.1,length=41934,assembly=GRCh37>
##contig=<ID=GL000241.1,length=42152,assembly=GRCh37>
##contig=<ID=GL000243.1,length=43341,assembly=GRCh37>
##contig=<ID=GL000242.1,length=43523,assembly=GRCh37>
##contig=<ID=GL000230.1,length=43691,assembly=GRCh37>
##contig=<ID=GL000237.1,length=45867,assembly=GRCh37>
##contig=<ID=GL000233.1,length=45941,assembly=GRCh37>
##contig=<ID=GL000204.1,length=81310,assembly=GRCh37>
##contig=<ID=GL000198.1,length=90085,assembly=GRCh37>
##contig=<ID=GL000208.1,length=92689,assembly=GRCh37>
##contig=<ID=GL000191.1,length=106433,assembly=GRCh37>
##contig=<ID=GL000227.1,length=128374,assembly=GRCh37>
##contig=<ID=GL000228.1,length=129120,assembly=GRCh37>
##contig=<ID=GL000214.1,length=137718,assembly=GRCh37>
##contig=<ID=GL000221.1,length=155397,assembly=GRCh37>
##contig=<ID=GL000209.1,length=159169,assembly=GRCh37>
##contig=<ID=GL000218.1,length=161147,assembly=GRCh37>
##contig=<ID=GL000220.1,length=161802,assembly=GRCh37>
##contig=<ID=GL000213.1,length=164239,assembly=GRCh37>
##contig=<ID=GL000211.1,length=166566,assembly=GRCh37>
##contig=<ID=GL000199.1,length=169874,assembly=GRCh37>
##contig=<ID=GL000217.1,length=172149,assembly=GRCh37>
##contig=<ID=GL000216.1,length=172294,assembly=GRCh37>
##contig=<ID=GL000215.1,length=172545,assembly=GRCh37>
##contig=<ID=GL000205.1,length=174588,assembly=GRCh37>
##contig=<ID=GL000219.1,length=179198,assembly=GRCh37>
##contig=<ID=GL000224.1,length=179693,assembly=GRCh37>
##contig=<ID=GL000223.1,length=180455,assembly=GRCh37>
##contig=<ID=GL000195.1,length=182896,assembly=GRCh37>
##contig=<ID=GL000212.1,length=186858,assembly=GRCh37>
##contig=<ID=GL000222.1,length=186861,assembly=GRCh37>
##contig=<ID=GL000200.1,length=187035,assembly=GRCh37>
##contig=<ID=GL000193.1,length=189789,assembly=GRCh37>
##contig=<ID=GL000194.1,length=191469,assembly=GRCh37>
##contig=<ID=GL000225.1,length=211173,assembly=GRCh37>
##contig=<ID=GL000192.1,length=547496,assembly=GRCh37>
##gwas_harmonisation_command=--out test/data/example.vcf --data test/data/example.h.txt --ref /data/db/human/gatk/2.8/b37/human_g1k_v37.fasta --id 300 --json test/example.json; 1.0.1
##file_date=2019-08-12T14:06:55.142804
##counts.total_variants=9
##counts.variants_not_read=0
##counts.harmonised_variants=9
##counts.variants_not_harmonised=0
##counts.switched_alleles=9
##cohort_sample_size=None
##cohort_frac_cases=None
##gwas.id=300
##gwas.type=continuous
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- 10 | 9960129 | rs4747841 | A | G | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0037:0.0052:0.145208:0.5092:89138:.:.:. 10 | 9960259 | rs4749917 | C | T | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0033:0.0052:0.11081:0.5092:89138:.:.:. 10 | 9960453 | rs12219605 | G | T | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0035:0.0052:0.131179:0.5092:89138:.:.:. 10 | 100012739 | rs737656 | A | G | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0099:0.0054:1.39794:0.6794:89888:.:.:. 10 | 100012890 | rs737657 | A | G | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0084:0.0054:1.07428:0.6794:89888:.:.:. 10 | 100013563 | rs7086391 | C | T | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0075:0.0067:0.570409:0.219:89888:.:.:. 10 | 100013815 | rs878177 | C | T | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0073:0.0055:0.861382:0.3483:89888:.:.:. 10 | 100015603 | rs3763689 | G | T | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0066:0.0063:0.489187:0.2203:89888:.:.:. 10 | 100016339 | rs1983865 | C | T | . | PASS | . | EFFECT:SE:L10PVAL:AF:N:ZSCORE:SIMPINFO:PROPCASES | -0.0037:0.0052:0.469288:0.4591:89888:.:.:.
What do you think?
Wow interesting! Can you give me an example of how you would extract specific queries?
Questions that arise:
Ok so a query would look like this?
bcftools view -O --samples 300 --include 'L10PVAL>8'
that would extract everything with log 10 pval > 8 for dataset 300?
Yeh that's correct, I'll get back to you with some examples
From: gibran hemani notifications@github.com Sent: 12 August 2019 15:23:54 To: MRCIEU/gwas_harmonisation gwas_harmonisation@noreply.github.com Cc: Matt Lyon matt.lyon@bristol.ac.uk; Assign assign@noreply.github.com Subject: Re: [MRCIEU/gwas_harmonisation] allow for storing multiple traits per VCF (use the sample field) (#15)
Ok so a query would look like this?
bcftools view -O --samples 300 --include 'L10PVAL>8'
that would extract everything with log 10 pval > 8 for dataset 300?
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/MRCIEU/gwas_harmonisation/issues/15?email_source=notifications&email_token=ADC5GSVXSUCAUYJJ4CD2PHDQEFW7VA5CNFSM4ILA4Z2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4CWDAY#issuecomment-520446339, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADC5GSTDL4WYKMJUCZKMFN3QEFW7VANCNFSM4ILA4Z2A.
44 UK Biobank BCFs = 11.24 GB individually or 4.0GB merged
[ml18692@bc4login3 vcf_08_19]$ ls -lh *bcf
-rw-rw----+ 1 ml18692 sscm 249M Aug 13 10:37 data.batch_189.bcf
-rw-rw----+ 1 ml18692 sscm 246M Aug 13 10:38 data.batch_20001_1017.bcf
-rw-rw----+ 1 ml18692 sscm 247M Aug 13 10:38 data.batch_20001_1052.bcf
-rw-rw----+ 1 ml18692 sscm 248M Aug 13 10:44 data.batch_20003_1140862382.bcf
-rw-rw----+ 1 ml18692 sscm 242M Aug 13 10:45 data.batch_20003_1140862742.bcf
-rw-rw----+ 1 ml18692 sscm 249M Aug 13 10:45 data.batch_20003_1141146234.bcf
-rw-rw----+ 1 ml18692 sscm 245M Aug 13 10:38 data.batch_20003_1141164616.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:37 data.batch_20003_1141167308.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:36 data.batch_41200_A798.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:45 data.batch_41200_J241.bcf
-rw-rw----+ 1 ml18692 sscm 245M Aug 13 10:36 data.batch_41200_M838.bcf
-rw-rw----+ 1 ml18692 sscm 243M Aug 13 10:36 data.batch_41200_T411.bcf
-rw-rw----+ 1 ml18692 sscm 243M Aug 13 10:38 data.batch_41200_W048.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:38 data.batch_41200_W443.bcf
-rw-rw----+ 1 ml18692 sscm 239M Aug 13 10:45 data.batch_41201_Y090.bcf
-rw-rw----+ 1 ml18692 sscm 242M Aug 13 10:38 data.batch_41201_Y445.bcf
-rw-rw----+ 1 ml18692 sscm 245M Aug 13 10:37 data.batch_41202_C183.bcf
-rw-rw----+ 1 ml18692 sscm 246M Aug 13 10:37 data.batch_41202_C539.bcf
-rw-rw----+ 1 ml18692 sscm 242M Aug 13 10:45 data.batch_41202_C800.bcf
-rw-rw----+ 1 ml18692 sscm 242M Aug 13 10:45 data.batch_41202_G518.bcf
-rw-rw----+ 1 ml18692 sscm 240M Aug 13 10:37 data.batch_41202_H508.bcf
-rw-rw----+ 1 ml18692 sscm 248M Aug 13 10:36 data.batch_41202_M5499.bcf
-rw-rw----+ 1 ml18692 sscm 245M Aug 13 10:47 data.batch_41202_M722.bcf
-rw-rw----+ 1 ml18692 sscm 246M Aug 13 10:44 data.batch_41202_S022.bcf
-rw-rw----+ 1 ml18692 sscm 245M Aug 13 10:38 data.batch_41202_S819.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:44 data.batch_41203_3000.bcf
-rw-rw----+ 1 ml18692 sscm 240M Aug 13 10:40 data.batch_41204_F331.bcf
-rw-rw----+ 1 ml18692 sscm 244M Aug 13 10:36 data.batch_41204_H571.bcf
-rw-rw----+ 1 ml18692 sscm 242M Aug 13 10:36 data.batch_41204_M0693.bcf
-rw-rw----+ 1 ml18692 sscm 244M Aug 13 10:37 data.batch_41204_M1300.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:44 data.batch_41204_M7983.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:36 data.batch_41204_M953.bcf
-rw-rw----+ 1 ml18692 sscm 245M Aug 13 10:37 data.batch_41204_T802.bcf
-rw-rw----+ 1 ml18692 sscm 247M Aug 13 10:44 data.batch_41204_Y425.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:36 data.batch_41204_Z548.bcf
-rw-rw----+ 1 ml18692 sscm 245M Aug 13 10:37 data.batch_41204_Z967.bcf
-rw-rw----+ 1 ml18692 sscm 247M Aug 13 10:37 data.batch_41210_L913.bcf
-rw-rw----+ 1 ml18692 sscm 241M Aug 13 10:37 data.batch_41210_O192.bcf
-rw-rw----+ 1 ml18692 sscm 247M Aug 13 10:44 data.batch_41210_S472.bcf
-rw-rw----+ 1 ml18692 sscm 243M Aug 13 10:36 data.batch_41210_T867.bcf
-rw-rw----+ 1 ml18692 sscm 245M Aug 13 10:45 data.batch_41210_U221.bcf
-rw-rw----+ 1 ml18692 sscm 248M Aug 13 10:44 data.batch_41210_Z846.bcf
-rw-rw----+ 1 ml18692 sscm 247M Aug 13 10:36 data.batch_41210_Z944.bcf
-rw-rw----+ 1 ml18692 sscm 248M Aug 13 10:44 data.batch_6156_15.bcf
-rw-rw----+ 1 ml18692 sscm 4.0G Aug 13 11:15 merged.bcf
There is a significant performance hit when BCFs are merged:
[ml18692@bc4login1 vcf_08_19]$ time bcftools query -i 'INFO/L10PVAL > 7.3' -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%EFFECT\t%SE\t%L10PVAL\n' ../vcf_04_19/data.batch_41210_Z944.bcf > query.indiv.info.txt
real 0m7.292s
user 0m7.234s
sys 0m0.058s
[ml18692@bc4login1 vcf_08_19]$ time bcftools query -i 'FORMAT/LP > 7.3' -f '%CHROM\t%POS\t%ID\t%REF\t%ALT[\t%ES\t%SE\t%LP]\n' data.batch_41210_Z944.bcf > query.indiv.txt
real 0m5.761s
user 0m5.693s
sys 0m0.068s
[ml18692@bc4login1 vcf_08_19]$ time bcftools query -s "UKB-b:1002" -i 'FORMAT/LP > 7.3' -f '%CHROM\t%POS\t%ID\t%REF\t%ALT[\t%ES\t%SE\t%LP]\n' merged.bcf > query.merge.txt
real 0m37.360s
user 0m36.454s
sys 0m0.906s
[ml18692@bc4login1 vcf_08_19]$ md5sum query.*txt
293f1146b74dbf2b01e9ef5e2c24573b query.indiv.info.txt
293f1146b74dbf2b01e9ef5e2c24573b query.indiv.txt
293f1146b74dbf2b01e9ef5e2c24573b query.merge.txt
No. VCF supports duplicate positions and multi allelic rows so both are valid. However, whatever software we write should support both
Thoughts:
Shall we use this VCF spec but not merge? What do you think?
Thanks for this. I think this basically reconciles two problems - performance for single datasets and storage for omic datasets.
Is the difference in time between info and format consistent?
Yeh appears to be consistent, I ran it a few times. I shorted the keys i.e INFO/L10PVAL -> FORMAT/LP which might explain the difference.
How do you want to proceed?
The use of the sample data seems convincing.
I'm not sure what the downside is?
Any chance you can point me to the files you generated? Are they in a shared location? No hurry on this
Thanks Gib, the BCF files are here on BC4: /mnt/storage/private/mrcieu/research/scratch/IGD/data/public/ukbiobank/vcf_08_19
I'm really keen to start using BCFs for my research
No downsides that I'm aware of
Let's go with it. I'll contact Andrew and Eduardo to discuss it. It looks like genomicsdb is no longer under maintenance and tiledb seems to have support planned for ingestion from vcf but not implemented yet. But perhaps it will be viable by the time we come round to thinking about that.
Great thanks, I'll merge in.
I think I gave the wrong link for genomicsdb, it's still alive here: https://github.com/GenomicsDB/GenomicsDB
They provide a Dockerfile so we could test it out pretty quickly but agree it's very low priority
btw do you have info about why you don't like snpeff?
https://www.cureffi.org/2014/04/24/converting-genetic-variants-to-their-minimal-representation/
https://genome.sph.umich.edu/wiki/Variant_Normalization
SnpEff stores the annotation in a string in the INFO field (ANN format). The only way to assign the correct annotation to a multi allelic site is by matching the allele string from the ANN field with the alternative allele. Since the latter can change, the annotation is lost.
Happy to discuss
Example
Merged GWAS VCF annotated with SnpEFF:
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- chr1 | 1001 | . | CTCC | CCC,C,CCCC | . | . | ANN=CCC|…,C|…,CCCC|… | EFFECT | .:0.01:0.01 | 0.05:.:. | 0.02:.:.
Selecting a single study gives:
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- chr1 | 1001 | . | CTCC | C | . | . | ANN=CCC|…,C|…,CCCC|… | EFFECT | 0.01 chr1 | 1002 | . | T | C | . | . | ANN=CCC|…,C|…,CCCC|… | EFFECT | 0.01
The last row matches the WRONG annotation
Managed to import our GWAS BCFs into GenomicsDB (Docker) which accepted the files with a couple of tweaks:
In principle it works, will be interesting to test performance on a dedicated cluster
Thanks for this @mcgml I've been looking at tools for manipulating the bcf files e.g. in python and R. I think this could be really useful https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html
However it seems to have way more functionality with tabix indexed vcf.gz files rather than csi indexed bcf files.
I'm just wondering if we should use vcf.gz. It would sacrifice some speed performance for
Do you have thoughts on this?
Thanks @explodecomputer looks great, much more powerful than vcfR.
In my experience BCF isn't used much and I'm having to convert to VCF for almost every operation outside of bcftools. Would definitely be easier to use bgzip VCF. Might help with adoption too.
Perhaps we need to consider some use cases and evaluate if the performance hit is likely to be an issue? For example I don't plan on querying by P value much if tophits are already flagged. Fast dbSNP lookups would be good but can use SNPbase instead.
The harmonisation tool can write to both but we would need to update the pipeline elsewhere.
Also helpful for storage with GenomicsDB?