Open ljwh2 opened 3 months ago
rs71543110, found in pmid:37500982 (GCST90321118) with status level 2 curation done, was causing issues during DR. For the current ongoing DR, I deleted all associations for this particular GCST. Also attached is the list of SNPs from this publication that were not found in Ensembl. SNP_not found in ensembl.xlsx
This PMID containing rs71543110 was published and didn't cause any issues during the DR. It was duplicated in UI.
Another example that caused the DR to fail: Association 131528254, Accession Id ‘GCST90428059’ Study Id ‘131528155’ , RsId 'rs575623373'
Occasionally SNPs are duplicated during the curation process. It looks like this happens on import to Oracle.
If the studies are not yet published, the data release breaks
If the studies are published, this causes issues in prod as the SNP is listed twice in UI and download, here in GCST90085780
- Some recent examples are rs71543110, rs199679345, see also goci#719
I did some quick analysis of the associations download, looks like there are 83 SNPs which are duplicated in prod. A quick check suggests all of these look like merged SNPs in our UI, with the variant ID appearing as rs1 (rs2) in the search snippet, but I can't verify this in Ensembl.
All but 3 of them have the "merged" flag set to 0.
All are included in studies published in the Catalog after March 2022, although there are also examples of merged SNPs added recently that are correctly represented in UI e.g. https://www.ebi.ac.uk/gwas/search?query=rs138055607. Note this is around the time we switched to using depo-curation for routine curation workflow.
The full list of associations with duplicate SNPs (411 associations, 83 SNPs) is attached: assocs with SNP duplication.xlsx
This needs investigating and fixing such that -curators can extract SNPs as described in papers, which may include old or new rsIDs -unpublished SNPs do not break the DR -published SNPs appear only once in the UI & download