EBISPOT / gwas-user-requests

Repository to collect user requests and bug reports for the GWAS Catalog
3 stars 0 forks source link

Questions about how variants are being handled in the GWAS Catalog #61

Open DSuveges opened 2 years ago

DSuveges commented 2 years ago

Hi GWAS Team,

We are in the process of reviewing the logic of certain steps of our pipelines. For this, I need some help from you to better understand the data I could download from the GWAS Catalog website. These questions mostly revolve around how variants are being handled, mapped etc. Nothing major or nothing actionable on your side. The only reason I'm opening this ticket is to make the decisions I'll make trackable for the future.

Thank you so much for your answers! (I might came back with more :D)

ljwh2 commented 2 years ago

We are diving into the code to get a solid basis to answer your questions from - will come back to you in the next couple of weeks.

DSuveges commented 2 years ago

Thank you for looking into this @ljwh2!

ljwh2 commented 11 months ago

@DSuveges please confirm the latest data release looks good from your side? We are waiting on a response from Ensembl regarding errors for SNPs with multiple locations, detailed in this ticket, but everything else should be fixed

DSuveges commented 11 months ago

I haven't had enough time to look into the details, but it seems compared to the 22.11 release, proportionately the number of unmapped variants(associations) increased:

QC_flag flag_count version total_count flag_percent
Incomplete genomic mapping 39,939 23.09.11 552,116 7.2
Incomplete genomic mapping 19,872 22.11.29 434,351 4.6

It's very strange, however I'm not sure when I'll have enough time to take a deeper look.

ljwh2 commented 11 months ago

Hi @DSuveges Is that taking into account whether the associations have a resolvable rsID?

DSuveges commented 11 months ago

No, my bad. Looking back into my notes, this is a bigger number containing all associations with no genomic mapping. Roughly 2k association has no mapping, while the snp identifier is a single rsID.

The numbers got improved: this number went down to 2,090 from 2,412 (same releases that were compared above).

DSuveges commented 11 months ago

These are the numbers of variants mapped in the new vs old release:

+-----------------------+------+
|mappedIn               |count |
+-----------------------+------+
|Mapped in both releases|213633|
|Mapped only in new     |45605 |
|Not mapped at all      |1621  |
|Mapped only in old     |15    |
+-----------------------+------+

The wast majority of the variants are mapped in both, however there's a better mapping efficiency it the new release (however I did not intersect the studies, so it is expected that there are more new associations in the new release. HOwever there are still variants that were not mapped in the new release, however they were successfully mapped in the old:

+-----------+-----------+----------+------------------+
|rsId       |snp_id_old |snp_id_new|mappedIn          |
+-----------+-----------+----------+------------------+
|rs9340789  |6:151822753|null      |Mapped only in old|
|rs1353747  |5:59041654 |null      |Mapped only in old|
|rs59212267 |3:24423464 |null      |Mapped only in old|
|rs6060278  |20:35165459|null      |Mapped only in old|
|rs11673591 |19:41480023|null      |Mapped only in old|
|rs11874    |17:46939827|null      |Mapped only in old|
|rs8024538  |15:84825188|null      |Mapped only in old|
|rs1805007  |16:89919709|null      |Mapped only in old|
|rs5835988  |2:164645418|null      |Mapped only in old|
|rs769447487|13:94594884|null      |Mapped only in old|
|rs6088735  |20:35157873|null      |Mapped only in old|
|rs139130389|11:72139111|null      |Mapped only in old|
|rs3130467  |6:31219298 |null      |Mapped only in old|
|rs9262492  |6:31018238 |null      |Mapped only in old|
|rs1805008  |16:89919736|null      |Mapped only in old|
+-----------+-----------+----------+------------------+

For rs11673591, I could confirm the current GWAS Catalog release does not have mapping, however as the Google search show, this variant used to have mapping: image

ljwh2 commented 11 months ago

@sajo-ebi please could you look into whats going on with those variants in Daniel's table? They do not appear on the list of variants with Ensembl errors.

ljwh2 commented 11 months ago

@DSuveges we can see mapping for all those variants in the latest release (except for rs11673591, this is one that gives an error in Ensembl API). How are you verifying the mapping?

DSuveges commented 11 months ago

My "new" dataset was from 11th September. Could it explain the discrepancy?

DSuveges commented 11 months ago

By updating for the most recent release (r2023-09-2) things didn't get much better:

+-----------------------+------+
|mappedIn               |count |
+-----------------------+------+
|Mapped in both releases|213390|
|Mapped only in new     |1559  |
|Mapped only in old     |258   |
|Not mapped at all      |249   |
+-----------------------+------+

Some of the 250:

+-----------+-----------+----------+------------------+
|rsId       |snp_id_old |snp_id_new|mappedIn          |
+-----------+-----------+----------+------------------+
|rs10681181 |9:89591874 |null      |Mapped only in old|
|rs11369670 |3:177187167|null      |Mapped only in old|
|rs11446793 |2:55961630 |null      |Mapped only in old|
|rs11651755 |17:37739849|null      |Mapped only in old|
|rs13132569 |4:60737438 |null      |Mapped only in old|
|rs139684765|1:111675471|null      |Mapped only in old|
|rs142971575|3:192461558|null      |Mapped only in old|
|rs1794265  |6:32706960 |null      |Mapped only in old|
|rs191686630|5:59181572 |null      |Mapped only in old|
|rs28383233 |6:32616376 |null      |Mapped only in old|
|rs34716983 |12:92567399|null      |Mapped only in old|
|rs35904419 |8:9519301  |null      |Mapped only in old|
|rs529619980|17:28720226|null      |Mapped only in old|
|rs543290203|14:63767112|null      |Mapped only in old|
|rs635608   |19:54192638|null      |Mapped only in old|
|rs74651796 |3:170411821|null      |Mapped only in old|
|rs749913156|10:99397622|null      |Mapped only in old|
|rs749924   |2:242084344|null      |Mapped only in old|
|rs103294   |19:54293995|null      |Mapped only in old|
|rs113271760|2:173991765|null      |Mapped only in old|
+-----------+-----------+----------+------------------+

I did checked some of them, they all had no mapping.

DSuveges commented 11 months ago

You can see this really puzzling stochasticity in the process? There are unmapped variants, but they are different and varying in number.

sajo-ebi commented 11 months ago

@DSuveges we had identified a bug with the code in Mapping pipeline & fixed the same before mapping to latest release & most of previous rsids which were not mapped previously have been mapped now, most of the ones we are having mapping failure are the ones coming from Ensembl with the latest release , Some the ids which you have shared above are in the list of rsIds which are giving error from Ensembl , we can evaluate for all & check if still there is any code in mapping pipeline

can you share the the list of rsids which have not been mapped with the new release we can try to rerun the pipeline for them

DSuveges commented 11 months ago

I have attached the list. : failed_mappings.txt Apparently all, except 3 has mapping on Ensembl (tested via sending requrests via REST API). Some of them has multiple mappings, but the overwhelming majority has only one mapping.

+-------------------------------+-----+
|Number of mappings from Ensembl|count|
+-------------------------------+-----+
|                           null|    3|
|                            1.0|  123|
|                           10.0|   21|
|                           18.0|    1|
|                            2.0|   65|
|                            3.0|   11|
|                            4.0|    3|
|                            5.0|    8|
|                            6.0|    8|
|                            7.0|    7|
|                            8.0|    8|
+-------------------------------+-----+

The three examples, where no mapping was available are composite variants:

+------------------------+-----------+----------+------------------+--------+-------------+
|rsId                    |snp_id_old |snp_id_new|mappedIn          |mappings|mapping_count|
+------------------------+-----------+----------+------------------+--------+-------------+
|rs34922185, rs34922185  |6:29850058 |null      |Mapped only in old|null    |null         |
|rs756441495, rs756441495|16:15012271|null      |Mapped only in old|null    |null         |
|rs782470771, rs782470771|7:73608604 |null      |Mapped only in old|null    |null         |
+------------------------+-----------+----------+------------------+--------+-------------+

I was only filtering for composite associations based on the presence of ;, x and -... The rs34922185, rs34922185 is especially strange given the same variant id is listed twice. Maybe it can be fixed in the database (other associations from the same study doesn't seem to be composite)

sajo-ebi commented 11 months ago

@DSuveges Thanks for sharing the failed mapping list , we have reran mapping pipeline again for these , out of the 258 variants , 125 are now mapped in Curation DB (next Data release will update data in Catalog). 133 we are getting error from Ensembl API , the error we are getting while trying to retrieving Cytogenetic Band for multiple locations of the variant , we have shared the details with Ensembl team for further analysis. Attaching files with details of mapped rsids, Failed rsids & sheet containing Ensembl error for the rsIds which are not mapped Mapping_Error_Daniel_List.xlsx Rsid-mapping-results.xlsx

DSuveges commented 11 months ago

Thank you @sajo-ebi! To me it seems all process fails, where at least any of the mapping points to a patch region, because it seems the location for the patch regions are not defined well. See the below example: rs10908278.

When looking up the variant on Ensembl's REST API these are the mappings:

  {
    "start": 37739961,
    "ancestral_allele": "A",
    "assembly_name": "GRCh38",
    "end": 37739961,
    "seq_region_name": "17",
    "coord_system": "chromosome",
    "strand": 1,
    "location": "17:37739961-37739961",
    "allele_string": "A/C/T"
  },
  {
    "end": 37744485,
    "coord_system": "scaffold",
    "seq_region_name": "HSCHR17_7_CTG4",
    "strand": 1,
    "start": 37744485,
    "ancestral_allele": null,
    "assembly_name": "GRCh38",
    "location": "HSCHR17_7_CTG4:37744485-37744485",
    "allele_string": "T/A/C"
  }

Interestingly, basepair location is the same on both chromosome and patch, but this cannot be correct, becaure the patch chromosome is just not that long. The same error you are getting when you try to retrive the sequence based on the above REST API response:

{
  "error": "Cannot request a slice whose start (37739961) is greater than 2877074 for HSCHR17_7_CTG4."
}

To me it seems an Ensembl issue, which might as well being a feature on their interpretation assuming you know the relative position how the patch region on the reference chromosome. That would explain why the basepair position is the same. To be honest I would not be worried too much about this problem, from the GWAS Catalog point of view, only mappings for the canonical chromosomes are important, so I would drop all mapping pointing to patches (applying a filter: mapping["coord_system"] == "chromosome"). What do you think @ljwh2?

@sajo-ebi , when you are saying "out of the 258 variants , 125 are now mapped". To me it seems there's no satisfying explanation why they failed in the first palce, so there's a risk that these seemingly "normal" cases occur randomly, and any time you are re-running the mapping on the full set, a similar set of random "normal" variants might fail mapping again. What do you think? Also, while I did my most recent exploration and tried to get mapping I got this:

rsId canonical_mapping
rs2519093 ['No mapping at all.']
rs62073542 ['No mapping at all.']

However, when manually checking the responses, both variants do have mapping. To me it indicates that the REST API response is simply unreliable. When you are building the production service on this API request, this needs to be accounted for.

DSuveges commented 11 months ago

I knew! I keep running the same API requests on the same set of variant and getting:

{
  "error": "Can't call method \"close\" on unblessed reference at /nfs/public/ro/ensweb/live/rest/www_110/ensembl-io/modules/Bio/EnsEMBL/IO/TabixParser.pm line 121.\n"
}

It's an Ensembl bug.

sajo-ebi commented 11 months ago

@DSuveges thanks for your feedback, unfortunately we don't have the full remapping logs to revisit the mapping issue with the rsids you mentioned , but we identified a bug in the code w.r.t multi-snp mapping for association which was the root cause for the variants mapping failure which you highlighted when you opened the initial ticket , this was not any random mapping failure.

I can only think of cluster failure of some jobs as we refactored the mapping pipeline to run in cluster in batches , we will continue to look out for any failures in our scheduled remapping , we will discuss to implement the solution w.r.t the Ensembl errors you mentioned.

DSuveges commented 11 months ago

OK. Thanks for loking into it. I leave this to you guys. Please let me know when only those rsIDs don't have mapping in the GWAS Catalog data, which are actually don't have a valid chromosome:basepair position.

ljwh2 commented 11 months ago

@DSuveges thanks for this suggestion re patches, we are waiting to hear back from Ensembl, but this could be a good solution.

sajo-ebi commented 10 months ago

Have run the mapping pipeline with the fix suggested by Daniel , the failed rsIds should be mapped successfully after the Data release, @DSuveges had a query the filter mapping["coord_system"] == "chromosome") if fine for the next full remapping or this was just a workaround for the Ensembl errors?

DSuveges commented 10 months ago

This is a known issue on Ensembl that mappings on scaffolds is problematic ticket1, ticet2. This problem persist for at least 5 Ensembl releases and we don't know when the fix is rolled out. So applying this filter: mapping["coord_system"] == "chromosome" is a workaround to avoid triggering the error, however I would consider it a permanent solution as having mappings for scaffolds bears very minor relevance.

sprintell commented 10 months ago

@sajo-ebi pls check with Daniel SUveges if all is well with this.

sprintell commented 10 months ago

This should be further taken care of by issue 1191

sprintell commented 5 months ago

still waiting for issue 1191 to eb confirmed