biothings / mychem.info

MyChem.info: A BioThings API for chemical/drug annotations
http://mychem.info
Apache License 2.0
15 stars 12 forks source link

Examples of non-ideal merging of records #162

Open andrewsu opened 1 year ago

andrewsu commented 1 year ago

The merging of multiple records in source databases into a single record in mychem.info is a challenging process, and one where I doubt we'll ever get it perfectly "right". Having said that, I noticed an example where the current merging is not ideal, and so I'm creating this issue to document this example and others like it.

This is the API call that illustrates this example: https://mychem.info/v1/chem/GVJHHUAWPYXKBD-IEOSBIPESA-N?fields=chembl.molecule_chembl_id,chembl.max_phase,chembl.pref_name,drugcentral.xrefs.chembl_id

{
  "_id": "GVJHHUAWPYXKBD-IEOSBIPESA-N",
  "_version": 1,
  "chembl": {
    "_license": "http://bit.ly/2KAUCAm",
    "max_phase": 0,
    "molecule_chembl_id": "CHEMBL47",
    "pref_name": "VITAMIN E"
  },
  "drugcentral": [
    {
      "_license": "http://bit.ly/2SeEhUy",
      "xrefs": {
        "chembl_id": [
          "CHEMBL3989727",
          "CHEMBL2108106"
        ]
      }
    },
    {
      "_license": "http://bit.ly/2SeEhUy",
      "xrefs": {
        "chembl_id": [
          "CHEMBL3989727",
          "CHEMBL47"
        ]
      }
    }
  ]
}

mychem only maps this record to a single ChEMBL ID -- CHEMBL47, but DrugCentral maps to two additional IDs: CHEMBL3989727 and CHEMBL2108106. All of these IDs are some variant of Vitamin E. One reason this is confusing because CHEMBL47 reports "max_phase": 0, whereas the other two are "max_phase": 4 (what one would expect for Vitamin E).

newgene commented 1 year ago

I had a quick look at this particular case, in most of these drugcentral documents, 4097 out of 5399, drugcentral does include a field for inchikey, like in this query:

https://mychem.info/v1/chem/GVJHHUAWPYXKBD-IEOSBIPESA-N?fields=drugcentral.xrefs,drugcentral.structures.inchikey

{
    "_id": "GVJHHUAWPYXKBD-IEOSBIPESA-N",
    "_version": 1,
    "drugcentral": [
        {
            "_license": "http://bit.ly/2SeEhUy",
            "structures": {
                "inchikey": "GVJHHUAWPYXKBD-IEOSBIPESA-N"
            },
            "xrefs": {
                "chebi": "CHEBI:18145",
                "chembl_id": [
                    "CHEMBL3989727",
                    "CHEMBL2108106"
                ],

In this case, the merging step will be based on the inchikey and skip the rest of xrefs IDs. Whether we should change this behavior (set a priority list of ID types, stop and merge once we find one), it probably depends on how confident we trust the drugcentral.xrefs.