EBISPOT / goci

GWAS Catalog Ontology and Curation Infrastructure
Apache License 2.0
26 stars 19 forks source link

Ensembl mapping pipeline - incorrect Y-RNA mapping #727

Open earlEBI opened 2 years ago

earlEBI commented 2 years ago

From gwas-info:

"I downloaded the catalog data from your website (gwas-association-downloaded_2022-05-19-EFO_0000729.tsv) and unfortunately recognised, that all occurrences of the gene name Y_RNA are mapped to ENSG00000199357, which is at least in most cases most unlikely (there are several Y_RNA genes across all chromosomes)."

Searching Ensembl I can see ENSG00000199357 is on chr18: https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000199357;r=18:23456371-23456467;t=ENST00000362487

but is in the upstream / downstream mapping column for many associations on different chromosomes:

Screenshot 2022-05-19 at 10.08.11.png

earlEBI commented 2 years ago

follow-up email from user: "it is also true for the gene SNORD116 which is currently mapped to ENSG00000252985 on chr 9 instead of the correct gene ENSG00000212553 on chr 13."

ljwh2 commented 1 year ago

@sajo-ebi In order to know whether this is an issue or not, I need to understand how the UPSTREAM_GENE_ID, DOWNSTREAM_GENE_ID and SNP_GENE_ID columns are generated. I suspect the mapping is done SNP -> MAPPED_GENE -> GENE_ID, but it should be SNP-> GENE_ID directly

ljwh2 commented 11 months ago

@sprintell I need the process flow documentation to define this issue properly

sajo-ebi commented 9 months ago
ljwh2 commented 9 months ago

Thanks @sajo-ebi. To clarify in step 2, "get overlapping genes", is this the gene name? And then the final step is to retrieve the ensembl id for the retrieved genes?

sajo-ebi commented 9 months ago

@ljwh2 the overlapping genes are the ones which give match for the chromosome position , the Ensembl Id & gene information is retrieved in 2nd step itself , then the gene information for upstream & downstream genes are determined ,

ljwh2 commented 9 months ago

@ljwh2 to investigate further and provide Sajo with rsIDs to investigate

ljwh2 commented 8 months ago

Some examples: rs5758209. This has genomic location chr22:41065861, upstream gene Y_RNA. Upstream gene ID is ENSG00000201314, but this ID maps to a genomic location on chr4.

rs1705773 SNP maps to genomic location chr12:34016940, upstream gene Y_RNA Upstream gene ID is ENSG00000201314, with genomic location on chr4

The problem is that there are several genes all called Y_RNA on different genomic locations and with different gene IDs. But we give the same gene ID for all.

sajo-ebi commented 7 months ago

CHR_ID CHR_POS REPORTED GENE(S) MAPPED_GENE UPSTREAM_GENE_ID DOWNSTREAM_GENE_ID SNP_GENE_IDS UPSTREAM_GENE_DISTANCE DOWNSTREAM_GENE_DISTANCE STRONGEST SNP-RISK ALLELE SNPS 22 41065861 ST13 Y_RNA - ACTBP15 ENSG00000201749 ENSG00000213857 215 8319 rs5758209-T rs5758209

@ljwh2 the above is the catalog download file entry for the snp 'rs5758209' , it shows the upstream geneId of 'ENSG00000201749' , when you say the UPstream Id is 'ENSG00000201314' , where did you get this value from ?

ljwh2 commented 7 months ago

The problem is that there are several genes all called Y_RNA on different genomic locations, each having a different gene ID. But we give the same gene ID for all.

Based on the information so far, I would guess each time the mapping is done, it picks one gene ID (presumably the first one it encounters) and applies it to all the instances of gene name = Y_RNA. The original bug report from the user said that all were mapped to ENSG00000199357, in December I found they were all mapped to ENSG00000201314, Sajo found they are mapped to ENSG00000201749, in latest release they are mapped to ENSG00000201343. But on all occasions, they are mapped to the same gene ID, when there should be several different ones.

rs5758209 - This has genomic location chr22:41065861, upstream gene Y_RNA. I think the gene ID should be ENSG00000199515 rs1705773 - SNP maps to genomic location chr12:34016940, upstream gene Y_RNA. I think the gene ID should be ENSG00000201624 rs6671332 - SNP maps to genomic location chr1:161702661, upstream gene Y_RNA. I think the gene ID should be ENSG00000199595