EBISPOT / goci

GWAS Catalog Ontology and Curation Infrastructure
Apache License 2.0
26 stars 19 forks source link

Investigate efficiency of harmonisation pipeline for WGS studies #1253

Open ljwh2 opened 4 months ago

ljwh2 commented 4 months ago

We would like to verify whether there is a significant difference in efficiency of the hm pipeline for seq GWAS.

  1. Calculate the average % of dropped and unable to harmonise variants among a representative sample of array-based summary statistics.
  2. Calculate the average % of dropped and unable to harmonise variants

If possible it could be useful to analyse separately for GWAS-SSF and pre-GWAS-SSF formats

jiyue1214 commented 3 months ago
Genome-wide sequencing: GWAS_id Techniques harmonised Raw_rows Harmonised_rows hm_14 hm_15 hm_16 Drop_ration hm_15(%)
1 GCST90010173 Genome-wide sequencing yes 24181159 18290576 0 0 0 24.36% 0.00%
2 GCST90093113 Genome-wide sequencing yes 7173861 7164907 0 47998 0 0.12% 0.67%
3 GCST90001390 Genome-wide sequencing yes 7843596 7654311 0 45969 2 2.41% 0.59%
4 GCST90014052 Genome-wide sequencing yes 5056041 5056029 0 11168 2 0.00% 0.22%
5 GCST90161593 Genome-wide sequencing yes 10004360 9450643 0 0 0 5.53% 0.00%
V.S. Genome-wide genotyping array: PMID GCST_id genotyping array harmonised Raw_rows Harmonised_rows hm_14 hm_15 hm_16 Drop_ration hm_15(%)
33589840 GCST90012878 Genome-wide genotyping array yes 25643629 25367157 0 292056 67 1.08% 1.14%
28887542 GCST005069 Genome-wide genotyping array yes 25290284 25186082 19 179244 2 0.41% 0.71%
33782385 GCST012278 Genome-wide genotyping array yes 7216416 7180648 1 35317 0 0.50% 0.49%
33143745 GCST90093334 Genome-wide genotyping array yes 8034880 7982170 0 131524 18 0.66% 1.64%
30053915 GCST006353 Genome-wide genotyping array yes 5694112 5692296 7 24756 0 0.03% 0.43%
jiyue1214 commented 3 months ago

Next to do:

  1. GCST90010173, and GCST90161593: explore the reason why ~20% variants are dropped in the harmonised file.
  2. Run harmonisation against the new ensemble version as well.
  3. Priority to harmonise data for 33937362, 35381062, 36124557, 36206743, 36327219, 36349687.
jiyue1214 commented 3 months ago
  1. Reason why variants are dropped
    • GCST90010173: contains lots of variants that reference allele=alternative allele ~ 10.5%; ~14% variants cannot find VCF records
    • GCST90161593: 5% variants cannot find VCF records
  2. Run harmonisation against the new ensemble version as well.
V_95 (2018) V_105 (2021) V_111 (2023) Total variants 2% variants
GCST90010173 75.64% 75.76% 78.39% 24181159 483623.18
GCST90179391 79.04% 74.38% 77.75% 30566328 611326.56
jiyue1214 commented 3 months ago

For variants that can be harmonised by V_95 but not V_111, it happens to two conditions:

  1. Some indels: variants representation in V_95 and V_111 are 1-base shift. These VCF records cannot be retrieved from the V_111 correctly and their indels representations are different from the input file.
  2. Some snps: multiple records can be retrieved from V_111, our pipeline does not know which is the correct one.
jiyue1214 commented 2 months ago

@ljwh2 Can we close this ticket? After our investigation:

  1. There were only 5 whole genome sequencing data.
  2. The rate of harmonization among these five studies varies widely, ranging from 75% to 100%.
  3. Compared to Genome-wide genotyping array data, which harmonisation rate varied from 98.02% to 99.97%, 4. There is no strong conclusion that the harmonization rate is significantly lower among the WGS.

We also tried to investigate if the updated reference VCF file improved the harmonisation rate among the WGS data, we tested on 8 studies, and 3 studies increased the rate and the other 5 decreased. Therefore, new reference VCF does not necessarily improve the harmonisation rate.

Our collaborator mentioned that they cannot use the variants that cannot be harmonised.