Sumstats with incorrect headers - Githubissues

EBISPOT / xgwas-curator-tasks

An internal repo for GWAS curators to track issues

0 stars 0 forks source link

Sumstats with incorrect headers #17

Closed ljwh2 closed 1 year ago

ljwh2 commented 1 year ago

Some sumstats that were added as part of splitting task have incorrect headers or format - Kettunen (wrong format), Suhre (no headers) & Draisma (file & headers in wrong format). This needs investigating and correcting

earlEBI commented 1 year ago

For Kettunen (PMID 27005778), the GCST files are the raw unformatted files, eg: chromosome position ID EA NEA eaf beta se p-value n_studies n_samples 1 51479 rs116400033 A T 0.224105 -0.016775 0.022137 0.453350 8 12475

The harmonised files on our ftp were harmonised a long time ago and have the following headers / first row: hm_variant_id hm_rsid hm_chrom hm_pos hm_other_allele hm_effect_allele hm_beta hm_odds_ratio hm_ci_lower hm_ci_upper hm_effect_allele_frequency hm_code chromosome base_pair_location variant_id p_value beta standard_error effect_allele_frequency effect_allele other_allele n_studies n_samples odds_ratio ci_upper ci_lower 1_51479_T_A rs116400033 1 51479 T A -0.016775 NA NA NA 0.22410500000000003 5 1 51479 rs116400033 0.45335 -0.016775 0.022137 0.22410500000000003 A T12475 NA NA NA

I'm not sure what the problem is with the format that you mention?

earlEBI commented 1 year ago

For Suhre (eg. GCST90101376_buildGRCh37.tsv.gz), I'm not sure whether files were harmonised recently or not. Info is here: https://app.zenhub.com/workspaces/gwas-59df823c4a6feb3786810391/issues/gh/ebispot/goci/496 (It sounds to me like they were also harmonised a long time ago.)

Can confirm the raw files have no header, eg: 4250-23_3_one NSFL1 cofactor p47 / one 9 rs724017 79333775 A ADD 995 3.296e-065.621e-05

The harmonised files look ok? but they don't quite match with the raw info. I'm not sure where the p-value has come from. hm_varid hm_rsid hm_chrom hm_pos hm_other_allele hm_effect_allele hm_beta hm_OR hm_OR_lowerCI hm_OR_upperCI hm_eaf hm_code SEQID PROT chromosome variant_id base_pair_location effect_allele other_allele model neff beta z p-value 9_76718859_G_A rs724017 9 76718859 G A 3.296e-06 14250-23_3_one NSFL1 cofactor p47 / one 9 rs724017 76718859 A G ADD 995 3.296e-06 5.621e-05 1

earlEBI commented 1 year ago

For Draisma, the raw file is unformatted: MarkerName Allele1 Allele2 Freq1 FreqSE MinFreq MaxFreq Effect StdErr P-value Direction NTot HetISq HetChiSq HetDf HetPVal rs2326918 a g 8.564e-01 2.140e-02 8.438e-01 9.617e-01 -3.100e-03 7.200e-036.692e-01 --++--- 7402 1.800e+01 7.321e+00 6 2.922e-01

And the harmonised files were harmonised before: hm_variant_id hm_rsid hm_chrom hm_pos hm_other_allele hm_effect_allele hm_beta hm_odds_ratio hm_ci_lower hm_ci_upper hm_effect_allele_frequency hm_code variant_id p_value beta standard_error effect_allele_frequency effect_allele other_allele FreqSE MinFreq MaxFreq Direction NTot HetISq HetChiSq HetDf HetPVal chromosome base_pair_location odds_ratio ci_upper ci_lower 6_130518946_A_G rs2326918 6 130518946 A G 0.0031 NA NA NA 0.14359999999999995 11 rs2326918 0.6692 -0.0031 0.0072 0.8564 a g 0.0214 0.8438 0.9617 --++--- 7402 18.0 7.321000000000001 6 0.2922 6 130518946 NA NA NA

earlEBI commented 1 year ago

@ljwh2 Could you tell me a bit more about what the problem is?

earlEBI commented 1 year ago

I think this ticket is covered by https://app.zenhub.com/workspaces/gwas-59df823c4a6feb3786810391/issues/gh/ebispot/goci/1056 and can be closed