Investigate efficiency of harmonisation pipeline for WGS studies

ljwh2 commented 8 months ago

We would like to verify whether there is a significant difference in efficiency of the hm pipeline for seq GWAS.

Calculate the average % of dropped and unable to harmonise variants among a representative sample of array-based summary statistics.
Calculate the average % of dropped and unable to harmonise variants

If possible it could be useful to analyse separately for GWAS-SSF and pre-GWAS-SSF formats

jiyue1214 commented 8 months ago

Genome-wide sequencing:		GWAS_id	Techniques	harmonised	Raw_rows	hm_14	hm_15	hm_16	Drop_ration
1	GCST90010173	Genome-wide sequencing	yes	24181159	18290576	0	0	24.36%	0.00%
2	GCST90093113	Genome-wide sequencing	yes	7173861	7164907	47998	0	0.12%	0.67%
3	GCST90001390	Genome-wide sequencing	yes	7843596	7654311	45969	2	2.41%	0.59%
4	GCST90014052	Genome-wide sequencing	yes	5056041	5056029	11168	2	0.00%	0.22%
5	GCST90161593	Genome-wide sequencing	yes	10004360	9450643	0	0	5.53%	0.00%

V.S. Genome-wide genotyping array:	PMID	GCST_id	genotyping array	harmonised	Raw_rows	Harmonised_rows	hm_14	hm_15	hm_16	Drop_ration
33589840	GCST90012878	Genome-wide genotyping array	yes	25643629	25367157	0	292056	67	1.08%	1.14%
28887542	GCST005069	Genome-wide genotyping array	yes	25290284	25186082	19	179244	2	0.41%	0.71%
33782385	GCST012278	Genome-wide genotyping array	yes	7216416	7180648	1	35317	0	0.50%	0.49%
33143745	GCST90093334	Genome-wide genotyping array	yes	8034880	7982170	0	131524	18	0.66%	1.64%
30053915	GCST006353	Genome-wide genotyping array	yes	5694112	5692296	7	24756	0	0.03%	0.43%

jiyue1214 commented 8 months ago

Next to do:

GCST90010173, and GCST90161593: explore the reason why ~20% variants are dropped in the harmonised file.
Run harmonisation against the new ensemble version as well.
Priority to harmonise data for 33937362, 35381062, 36124557, 36206743, 36327219, 36349687.

jiyue1214 commented 8 months ago

Reason why variants are dropped
- GCST90010173: contains lots of variants that reference allele=alternative allele ~ 10.5%; ~14% variants cannot find VCF records
- GCST90161593: 5% variants cannot find VCF records
Run harmonisation against the new ensemble version as well.

	V_95 (2018)	V_105 (2021)	V_111 (2023)	Total variants	2% variants
GCST90010173	75.64%	75.76%	78.39%	24181159	483623.18
GCST90179391	79.04%	74.38%	77.75%	30566328	611326.56

jiyue1214 commented 8 months ago

For variants that can be harmonised by V_95 but not V_111, it happens to two conditions:

Some indels: variants representation in V_95 and V_111 are 1-base shift. These VCF records cannot be retrieved from the V_111 correctly and their indels representations are different from the input file.
Some snps: multiple records can be retrieved from V_111, our pipeline does not know which is the correct one.

jiyue1214 commented 7 months ago

@ljwh2 Can we close this ticket? After our investigation:

There were only 5 whole genome sequencing data.
The rate of harmonization among these five studies varies widely, ranging from 75% to 100%.
Compared to Genome-wide genotyping array data, which harmonisation rate varied from 98.02% to 99.97%, 4. There is no strong conclusion that the harmonization rate is significantly lower among the WGS.

We also tried to investigate if the updated reference VCF file improved the harmonisation rate among the WGS data, we tested on 8 studies, and 3 studies increased the rate and the other 5 decreased. Therefore, new reference VCF does not necessarily improve the harmonisation rate.

Our collaborator mentioned that they cannot use the variants that cannot be harmonised.

EBISPOT / goci

Investigate efficiency of harmonisation pipeline for WGS studies #1253