Open DSuveges opened 2 years ago
We are diving into the code to get a solid basis to answer your questions from - will come back to you in the next couple of weeks.
Thank you for looking into this @ljwh2!
@DSuveges please confirm the latest data release looks good from your side? We are waiting on a response from Ensembl regarding errors for SNPs with multiple locations, detailed in this ticket, but everything else should be fixed
I haven't had enough time to look into the details, but it seems compared to the 22.11 release, proportionately the number of unmapped variants(associations) increased:
QC_flag | flag_count | version | total_count | flag_percent |
---|---|---|---|---|
Incomplete genomic mapping | 39,939 | 23.09.11 | 552,116 | 7.2 |
Incomplete genomic mapping | 19,872 | 22.11.29 | 434,351 | 4.6 |
It's very strange, however I'm not sure when I'll have enough time to take a deeper look.
Hi @DSuveges Is that taking into account whether the associations have a resolvable rsID?
No, my bad. Looking back into my notes, this is a bigger number containing all associations with no genomic mapping. Roughly 2k association has no mapping, while the snp identifier is a single rsID.
The numbers got improved: this number went down to 2,090 from 2,412 (same releases that were compared above).
These are the numbers of variants mapped in the new vs old release:
+-----------------------+------+
|mappedIn |count |
+-----------------------+------+
|Mapped in both releases|213633|
|Mapped only in new |45605 |
|Not mapped at all |1621 |
|Mapped only in old |15 |
+-----------------------+------+
The wast majority of the variants are mapped in both, however there's a better mapping efficiency it the new release (however I did not intersect the studies, so it is expected that there are more new associations in the new release. HOwever there are still variants that were not mapped in the new release, however they were successfully mapped in the old:
+-----------+-----------+----------+------------------+
|rsId |snp_id_old |snp_id_new|mappedIn |
+-----------+-----------+----------+------------------+
|rs9340789 |6:151822753|null |Mapped only in old|
|rs1353747 |5:59041654 |null |Mapped only in old|
|rs59212267 |3:24423464 |null |Mapped only in old|
|rs6060278 |20:35165459|null |Mapped only in old|
|rs11673591 |19:41480023|null |Mapped only in old|
|rs11874 |17:46939827|null |Mapped only in old|
|rs8024538 |15:84825188|null |Mapped only in old|
|rs1805007 |16:89919709|null |Mapped only in old|
|rs5835988 |2:164645418|null |Mapped only in old|
|rs769447487|13:94594884|null |Mapped only in old|
|rs6088735 |20:35157873|null |Mapped only in old|
|rs139130389|11:72139111|null |Mapped only in old|
|rs3130467 |6:31219298 |null |Mapped only in old|
|rs9262492 |6:31018238 |null |Mapped only in old|
|rs1805008 |16:89919736|null |Mapped only in old|
+-----------+-----------+----------+------------------+
For rs11673591, I could confirm the current GWAS Catalog release does not have mapping, however as the Google search show, this variant used to have mapping:
@sajo-ebi please could you look into whats going on with those variants in Daniel's table? They do not appear on the list of variants with Ensembl errors.
@DSuveges we can see mapping for all those variants in the latest release (except for rs11673591, this is one that gives an error in Ensembl API). How are you verifying the mapping?
My "new" dataset was from 11th September. Could it explain the discrepancy?
By updating for the most recent release (r2023-09-2
) things didn't get much better:
+-----------------------+------+
|mappedIn |count |
+-----------------------+------+
|Mapped in both releases|213390|
|Mapped only in new |1559 |
|Mapped only in old |258 |
|Not mapped at all |249 |
+-----------------------+------+
Some of the 250:
+-----------+-----------+----------+------------------+
|rsId |snp_id_old |snp_id_new|mappedIn |
+-----------+-----------+----------+------------------+
|rs10681181 |9:89591874 |null |Mapped only in old|
|rs11369670 |3:177187167|null |Mapped only in old|
|rs11446793 |2:55961630 |null |Mapped only in old|
|rs11651755 |17:37739849|null |Mapped only in old|
|rs13132569 |4:60737438 |null |Mapped only in old|
|rs139684765|1:111675471|null |Mapped only in old|
|rs142971575|3:192461558|null |Mapped only in old|
|rs1794265 |6:32706960 |null |Mapped only in old|
|rs191686630|5:59181572 |null |Mapped only in old|
|rs28383233 |6:32616376 |null |Mapped only in old|
|rs34716983 |12:92567399|null |Mapped only in old|
|rs35904419 |8:9519301 |null |Mapped only in old|
|rs529619980|17:28720226|null |Mapped only in old|
|rs543290203|14:63767112|null |Mapped only in old|
|rs635608 |19:54192638|null |Mapped only in old|
|rs74651796 |3:170411821|null |Mapped only in old|
|rs749913156|10:99397622|null |Mapped only in old|
|rs749924 |2:242084344|null |Mapped only in old|
|rs103294 |19:54293995|null |Mapped only in old|
|rs113271760|2:173991765|null |Mapped only in old|
+-----------+-----------+----------+------------------+
I did checked some of them, they all had no mapping.
You can see this really puzzling stochasticity in the process? There are unmapped variants, but they are different and varying in number.
@DSuveges we had identified a bug with the code in Mapping pipeline & fixed the same before mapping to latest release & most of previous rsids which were not mapped previously have been mapped now, most of the ones we are having mapping failure are the ones coming from Ensembl with the latest release , Some the ids which you have shared above are in the list of rsIds which are giving error from Ensembl , we can evaluate for all & check if still there is any code in mapping pipeline
can you share the the list of rsids which have not been mapped with the new release we can try to rerun the pipeline for them
I have attached the list. : failed_mappings.txt Apparently all, except 3 has mapping on Ensembl (tested via sending requrests via REST API). Some of them has multiple mappings, but the overwhelming majority has only one mapping.
+-------------------------------+-----+
|Number of mappings from Ensembl|count|
+-------------------------------+-----+
| null| 3|
| 1.0| 123|
| 10.0| 21|
| 18.0| 1|
| 2.0| 65|
| 3.0| 11|
| 4.0| 3|
| 5.0| 8|
| 6.0| 8|
| 7.0| 7|
| 8.0| 8|
+-------------------------------+-----+
The three examples, where no mapping was available are composite variants:
+------------------------+-----------+----------+------------------+--------+-------------+
|rsId |snp_id_old |snp_id_new|mappedIn |mappings|mapping_count|
+------------------------+-----------+----------+------------------+--------+-------------+
|rs34922185, rs34922185 |6:29850058 |null |Mapped only in old|null |null |
|rs756441495, rs756441495|16:15012271|null |Mapped only in old|null |null |
|rs782470771, rs782470771|7:73608604 |null |Mapped only in old|null |null |
+------------------------+-----------+----------+------------------+--------+-------------+
I was only filtering for composite associations based on the presence of ;
, x
and -
... The rs34922185, rs34922185
is especially strange given the same variant id is listed twice. Maybe it can be fixed in the database (other associations from the same study doesn't seem to be composite)
@DSuveges Thanks for sharing the failed mapping list , we have reran mapping pipeline again for these , out of the 258 variants , 125 are now mapped in Curation DB (next Data release will update data in Catalog). 133 we are getting error from Ensembl API , the error we are getting while trying to retrieving Cytogenetic Band for multiple locations of the variant , we have shared the details with Ensembl team for further analysis. Attaching files with details of mapped rsids, Failed rsids & sheet containing Ensembl error for the rsIds which are not mapped Mapping_Error_Daniel_List.xlsx Rsid-mapping-results.xlsx
Thank you @sajo-ebi! To me it seems all process fails, where at least any of the mapping points to a patch region, because it seems the location for the patch regions are not defined well. See the below example: rs10908278.
When looking up the variant on Ensembl's REST API these are the mappings:
{
"start": 37739961,
"ancestral_allele": "A",
"assembly_name": "GRCh38",
"end": 37739961,
"seq_region_name": "17",
"coord_system": "chromosome",
"strand": 1,
"location": "17:37739961-37739961",
"allele_string": "A/C/T"
},
{
"end": 37744485,
"coord_system": "scaffold",
"seq_region_name": "HSCHR17_7_CTG4",
"strand": 1,
"start": 37744485,
"ancestral_allele": null,
"assembly_name": "GRCh38",
"location": "HSCHR17_7_CTG4:37744485-37744485",
"allele_string": "T/A/C"
}
Interestingly, basepair location is the same on both chromosome and patch, but this cannot be correct, becaure the patch chromosome is just not that long. The same error you are getting when you try to retrive the sequence based on the above REST API response:
{
"error": "Cannot request a slice whose start (37739961) is greater than 2877074 for HSCHR17_7_CTG4."
}
To me it seems an Ensembl issue, which might as well being a feature on their interpretation assuming you know the relative position how the patch region on the reference chromosome. That would explain why the basepair position is the same. To be honest I would not be worried too much about this problem, from the GWAS Catalog point of view, only mappings for the canonical chromosomes are important, so I would drop all mapping pointing to patches (applying a filter: mapping["coord_system"] == "chromosome"
). What do you think @ljwh2?
@sajo-ebi , when you are saying "out of the 258 variants , 125 are now mapped". To me it seems there's no satisfying explanation why they failed in the first palce, so there's a risk that these seemingly "normal" cases occur randomly, and any time you are re-running the mapping on the full set, a similar set of random "normal" variants might fail mapping again. What do you think? Also, while I did my most recent exploration and tried to get mapping I got this:
rsId | canonical_mapping |
---|---|
rs2519093 | ['No mapping at all.'] |
rs62073542 | ['No mapping at all.'] |
However, when manually checking the responses, both variants do have mapping. To me it indicates that the REST API response is simply unreliable. When you are building the production service on this API request, this needs to be accounted for.
I knew! I keep running the same API requests on the same set of variant and getting:
{
"error": "Can't call method \"close\" on unblessed reference at /nfs/public/ro/ensweb/live/rest/www_110/ensembl-io/modules/Bio/EnsEMBL/IO/TabixParser.pm line 121.\n"
}
It's an Ensembl bug.
@DSuveges thanks for your feedback, unfortunately we don't have the full remapping logs to revisit the mapping issue with the rsids you mentioned , but we identified a bug in the code w.r.t multi-snp mapping for association which was the root cause for the variants mapping failure which you highlighted when you opened the initial ticket , this was not any random mapping failure.
I can only think of cluster failure of some jobs as we refactored the mapping pipeline to run in cluster in batches , we will continue to look out for any failures in our scheduled remapping , we will discuss to implement the solution w.r.t the Ensembl errors you mentioned.
OK. Thanks for loking into it. I leave this to you guys. Please let me know when only those rsIDs don't have mapping in the GWAS Catalog data, which are actually don't have a valid chromosome:basepair position.
@DSuveges thanks for this suggestion re patches, we are waiting to hear back from Ensembl, but this could be a good solution.
Have run the mapping pipeline with the fix suggested by Daniel , the failed rsIds should be mapped successfully after the Data release, @DSuveges had a query the filter mapping["coord_system"] == "chromosome") if fine for the next full remapping or this was just a workaround for the Ensembl errors?
This is a known issue on Ensembl that mappings on scaffolds is problematic ticket1, ticet2. This problem persist for at least 5 Ensembl releases and we don't know when the fix is rolled out. So applying this filter: mapping["coord_system"] == "chromosome"
is a workaround to avoid triggering the error, however I would consider it a permanent solution as having mappings for scaffolds bears very minor relevance.
@sajo-ebi pls check with Daniel SUveges if all is well with this.
This should be further taken care of by issue 1191
still waiting for issue 1191 to eb confirmed
Hi GWAS Team,
We are in the process of reviewing the logic of certain steps of our pipelines. For this, I need some help from you to better understand the data I could download from the GWAS Catalog website. These questions mostly revolve around how variants are being handled, mapped etc. Nothing major or nothing actionable on your side. The only reason I'm opening this ticket is to make the decisions I'll make trackable for the future.
CHR_ID
and theCHR_POS
is available? In these cases the variant identifiers in theSNPS
field are used to get mapping. Is that correct?SNPS
field, can we assume, that the same number of mapping is expected to be seen in theCHR_ID
andCHR_POS
columns separated by;
?CHR_ID
,CHR_POS
fields are null. eg. :Thank you so much for your answers! (I might came back with more :D)