bridgedb / create-bridgedb-metabolites

Create BridgeDb identity mapping files from HMDB, ChEBI, and Wikidata
Other
4 stars 4 forks source link

Wrong InchiKeys #20

Closed DeniseSl22 closed 5 years ago

DeniseSl22 commented 5 years ago

One of my interns (@IreneHemel) is working on identifier mappings between metabolites. She checked for the HMDB IDs that were given in IEMBase (as biomarkers for diseases), if these have any mappings to ChEBI (and if yes, which ones they are). She needs the ChEBI IDs to map correctly, since these are represented in the PWs she is working on. While doing this checking of mappings, she found some compounds to have two InChiKeys, for example for Thymine (HMDB0000262):

image

RWQNBRDOKXIBIV-UHFFFAOYSA-N InChIKey (Correct) image

YQHWOOLBIREPRR-VZUYHUTRSA-N InChIKey (Wrong) image

However, one of the inchiKeys is completely wrong (I looked for them through ChemSpider, see above).

I've tried to track where this mapping originates from (HMDB, ChEBI or Wikidata), however the wrong InchiKey is not present in any of them.... so I'm wondering why it is even queryable in the webservice of BridgeDb. According to the properties query, there are several metabolite-mapping files loaded (even originating from 2013!): image

So, I think we need to make sure that:

  1. Only 1 ID mapping set is loaded per DataNode/Interaction type. @nunogit
  2. Build a check for conflicting InChiKeys (the part before the first '-' should at least be similar, otherwise the structure is different). @egonw

@IreneHemel has several other examples, if needed. Some more are below: HMDB0000300 (2 inchikeys, PRFVPHBJWNBZBM-GGCSAXROSA-N not retrievable through scholia(=wikidata), chemspider and chebi). HMDB0000273 (2 inchikeys, KHUMAHPJNTVTEQ-DXQCBLCSSA-N not retrievable through scholia(=wikidata), chemspider and chebi).

egonw commented 5 years ago

This is a good use case to have more provenance of the history of that mapping.

@DeniseSl22, can you ask Irene to verify with locally with only the latest ID mapping file? Having more than one is bound to give issues, and maybe the only reason. At this moment we do not know if this is a problem in the code, in the data, or in the webservice.

DeniseSl22 commented 5 years ago

Yes I'll check together with her; we could use the R-script tutorial you added to the Tess portal of Elixir.

DeniseSl22 commented 5 years ago

@IreneHemel : https://tess.elixir-europe.org/materials/bridgedbr-tutorial#home

IreneHemel commented 5 years ago

I checked it for all HMDB IDs that reported a second Inchikey in the webservice, using the metabolites_20190509 file and only one Inchikey is reported back, the one that is also stated on the HMBD and ChEBI websites

DeniseSl22 commented 5 years ago

Okay thank you for checking @IreneHemel ! Then it's probably finding old inchikeys due to old mapping files that are loaded. @nunogit , what's the status of removing old versions of metabolomics mapping files :)?

DeniseSl22 commented 5 years ago

Okay, @nunogit removed the old mapping files, and here is the result: image

I check the three examples above, and they all now give 1 InchiKey. Also the "metadata" query indicated that only one metabolite.bridge file is loaded :D.