EBISPOT / goci

GWAS Catalog Ontology and Curation Infrastructure
Apache License 2.0
26 stars 19 forks source link

SNPs still not showing mapping in download file #1191

Closed ljwh2 closed 1 week ago

ljwh2 commented 11 months ago

I did some analysis of the latest association downloads file (2023-10-29) according to the following steps

  1. Filtered for associations with no data in on CHR_ID column
  2. Filtered on SNPS column for SNP IDs beginning with rs
  3. Excluded SNPs containing " x " or ";" (multi-SNP associations)
  4. Removed duplicate SNP IDs
  5. This gave a list of 1624 potentially mappable SNP IDs
  6. I ran these through the SNP validation tool implemented in DepoCuration
  7. 121 SNPs returned "Not found in Ensembl"
  8. 1503 SNPs can be found in Ensembl but do not have mapping in the download file
  9. Manual check (vs Ensembl UI) of the first few suggests most of these have a patch location as well as canonical chromosome listed in Ensembl, but were not caught by the latest fix.

@sajo-ebi please investigate but we may need to wait for fix on Ensembl side (which should be implemented very soon)

Results of analysis are attached SNP mapping analysis 7_11_23.xlsx

sajo-ebi commented 10 months ago

Have ran mapping pipeline against missing rsiDs , at first glance most seem to have been successfully mapped except the ones missing from Ensembl , will reflect after DR

ljwh2 commented 10 months ago

Unknown why these did not get mapped. Repeat analysis after next remapping.

ljwh2 commented 9 months ago

I have reanalysed the data after the 20 Dec release. At step 8, 141 SNPs were found in Ensembl but do not have mapping in the download file. See attached file: 125 were newly added to the database in this release ("new SNP"). All those I checked had patch locations. 9 pre-existing SNPs had no mapping either in this release or the previous one analysed ("no mapping in previous"). Of these, 7 had patch locations and two did not (rs34420345 and rs781710751). 6 pre-existing SNPs had mapping in the previous release ("mapping was present before"). In all cases, new data had been added (new or updated associations for the SNP) since the last release, and now no mapping is shown for any of the associations. SNP mapping analysis 8 Jan 2024.txt

ljwh2 commented 9 months ago

Since we know the issue with patch locations should be fixed in the next Ensembl release, I suggest we don't do anything about those, but it would be worth investigating the other issues i.e: rs34420345 and rs781710751 have no mapping either in this release or the earlier one These 6 SNPs had mapping previously, but now do not:

image.png
sprintell commented 9 months ago

This needs to be confirmed after the data release of today is executed. @sajo-ebi

ljwh2 commented 9 months ago

There are still many valid SNPs without mapping in the latest release (538 rsIDs). 469 of these had mapping in the previous release that I analysed (in Dec). I’m struggling to see any pattern, a spot check suggests several are on patch locations but not all. I noticed there are batches from the same publication, for example 55 from PMID 34648354 ( I haven’t been able to check if they are all in the same study). The list of unmapped valid SNPs is attached. Unmapped valid SNPs Jan 2024.txt

sajo-ebi commented 8 months ago

@ljwh2 I analysed since we last ran mapping pipeline against the Missing SNP based on 20 Dec release , out of the 140 rsId there was only 1 rsId 'rs41302593' which was unmapped in the new list based on the current data release . So apart from the missing Rsid execution , the remaining execution is just for scheduled mapping run when a new submission is made , the remaining Rsid were already unmapped in DB . I am not clear about what changes with every Data release that bring new unmapped rsid as only scheduled mapping is running so these rsids were already unmapped & picked up by latest data release . I believe this will be fixed when we do a full remapping again which will mean Full remapping executes against entire dataset not selective like happening now

ljwh2 commented 7 months ago

Results from DR 2024-03-11: There are 3236 potentially mappable rsIDs. Of these only 124 are valid in Ensembl but have no mapping in the Catalog. I did some manual check of around 30 and nearly half have some issue in the Ensembl data. For example these are found in Ensembl but have no mapping information: https://www.ensembl.org/Homo_sapiens/Variation/Explore?v=rs691461;vdb=variation rs527382443 rs691461 rs7031748 rs2437258 Some have no mapping to any canonical chromosome: rs41302593 rs72617242 rs6004031 rs9406351 rs10423754 rs361433 Here are some examples that look fine in Ensembl rs1854685 rs66767559 rs4027773 rs73598500 rs1795325153

Full list of 124 rsIDs attached, action for @sajo-ebi to rerun mapping against these and check if any more can be mapped, also to check if the types of example above can be distinguished based on the response from Ensembl. Unmapped SNPs March 14 2024.txt

sprintell commented 6 months ago

@sajo-ebi hs triggered the mapping pipeline for the rsIds above, so @ljwh2 please check if those SNPs are now mapped.

sajo-ebi commented 6 months ago

@ljwh2 Attached it the association mapping report from the 124 rsids https://docs.google.com/spreadsheets/d/1NfYgu6hTGBT9DP40p_yoSv8Ol08_wEPHnAyKt2XTWUk/edit#gid=0

sajo-ebi commented 5 months ago

@ljwh2 attached is the SNp to Rsid mapping ,I can see many chr ones; only thing I can think off is while the file based run was happening , the scheduled one was also running , hence some association apart from the list were also present in the report , there are no logs now to conclusively tell. Asscn_Rsid_mapping.tsv.zip

ljwh2 commented 5 months ago

Of the 124 SNPs: 58 got successfully mapped in the second round 53 exist in Ensembl with no mapping information 12 have only a patch location One remains that could be mapped, this is rs1854685

ljwh2 commented 4 months ago

@sajo-ebi to add better error logging. We need to be able to distinguish when Ensembl API returns blank response compared to SNP not found.

ljwh2 commented 3 weeks ago

@sajo-ebi to add details of where this information can be found in the database and then close the ticket

sajo-ebi commented 1 week ago

We can use the following query to check mapping errors in DB select count(*) from association_report where snp_error is not null

sprintell commented 1 week ago

Hi @ljwh2

ljwh2 commented 1 week ago

can be closed