EBISPOT / goci

GWAS Catalog Ontology and Curation Infrastructure
Apache License 2.0
26 stars 19 forks source link

Investigate & fix duplication of SNPs #1275

Open ljwh2 opened 3 months ago

ljwh2 commented 3 months ago

Occasionally SNPs are duplicated during the curation process. It looks like this happens on import to Oracle.

If the studies are not yet published, the data release breaks

If the studies are published, this causes issues in prod as the SNP is listed twice in UI and download, here in GCST90085780

Screenshot 2024-03-27 at 13.44.07.png Screenshot 2024-03-27 at 13.36.33.png

- Some recent examples are rs71543110, rs199679345, see also goci#719

I did some quick analysis of the associations download, looks like there are 83 SNPs which are duplicated in prod. A quick check suggests all of these look like merged SNPs in our UI, with the variant ID appearing as rs1 (rs2) in the search snippet, but I can't verify this in Ensembl.

Screenshot 2024-03-27 at 14.03.16.png

All but 3 of them have the "merged" flag set to 0.

All are included in studies published in the Catalog after March 2022, although there are also examples of merged SNPs added recently that are correctly represented in UI e.g. https://www.ebi.ac.uk/gwas/search?query=rs138055607. Note this is around the time we switched to using depo-curation for routine curation workflow.

The full list of associations with duplicate SNPs (411 associations, 83 SNPs) is attached: assocs with SNP duplication.xlsx

This needs investigating and fixing such that -curators can extract SNPs as described in papers, which may include old or new rsIDs -unpublished SNPs do not break the DR -published SNPs appear only once in the UI & download

Santhi1901 commented 3 months ago

rs71543110, found in pmid:37500982 (GCST90321118) with status level 2 curation done, was causing issues during DR. For the current ongoing DR, I deleted all associations for this particular GCST. Also attached is the list of SNPs from this publication that were not found in Ensembl. SNP_not found in ensembl.xlsx

Santhi1901 commented 2 months ago

This PMID containing rs71543110 was published and didn't cause any issues during the DR. It was duplicated in UI.

ljwh2 commented 2 months ago

Another example that caused the DR to fail: Association 131528254, Accession Id ‘GCST90428059’ Study Id ‘131528155’ , RsId 'rs575623373'