MassBank / MassBank-data

Official repository of open data MassBank records
74 stars 59 forks source link

Some Pubchem CID numbers in MassBank were deactivated in Pubchem Compound #186

Open dlswee opened 2 years ago

dlswee commented 2 years ago

Some Pubchem records in Pubchem Compound were modified and made "Non-live". For example, phenylthiourea (CID 7682) was replaced with a different tautomer of phenylthiourea (CID 676454). These changes happened fairly recently, and there is no cross-referencing index of old to new CIDs that I am aware of. As a result, some of the Pubchem CID entries in MassBank are no longer "live" even though the CIDs were entered correctly.

Normally you can search CID numbers by typing the CID as an integer (e.g. "7682") into the Pubchem search box. If no CID record is returned, typing "CID 7682" will bring up the Non-live record with a hyperlink to the CID of the current "Preferred Compound".

schymane commented 2 years ago

Thanks - this is something we are aware of but it is not necessarily trivial to overwrite old CIDs in some MassBank records due to licensing issues.

@meier-rene this is potentially something we could take care of at validation and/or run occasionally? It's not clear (yet) how many records are affected (whether 10s or 100s), nor whether this affects records that we can't necessarily edit. Some CIDs, e.g. guanylurea, actually migrate back and forth between CIDs occasionally ...

We have various functions that can help distinguish current live CID from deprecated CIDs. https://github.com/schymane/RChemMass/blob/master/R/ChemicalCuration.R#L1516 (maybe webchem does this better by now).

@meier-rene is it possible to get an overview of how many records are affected? Would cross-linking help (so the CID directs to the current CID) or would updating the CIDs our side be better?

dlswee commented 2 years ago

On 2021-11-08 09:53, Emma Schymanski wrote:

Thanks - this is something we are aware of but it is not necessarily trivial to overwrite old CIDs in some MassBank records due to licensing issues.

@meier-rene [1] this is potentially something we could take care of at validation and/or run occasionally? It's not clear (yet) how many records are affected (whether 10s or 100s), nor whether this affects records that we can't necessarily edit. Some CIDs, e.g. guanylurea, actually migrate back and forth between CIDs occasionally ...

We have various functions that can help distinguish current live CID from deprecated CIDs. https://github.com/schymane/RChemMass/blob/master/R/ChemicalCuration.R#L1516 (maybe webchem does this better by now).

@meier-rene [1] is it possible to get an overview of how many records are affected? Would cross-linking help (so the CID directs to the current CID) or would updating the CIDs our side be better?

-- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub [2], or unsubscribe [3]. Triage notifications on the go with GitHub Mobile for iOS [4] or Android [5].

Links:

[1] https://github.com/meier-rene [2] https://github.com/MassBank/MassBank-data/issues/186#issuecomment-963360131 [3] https://github.com/notifications/unsubscribe-auth/AC2K46LNUJJCJDEMTUXLK53UK76AVANCNFSM5HTAW4UQ [4] https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 [5] https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub

Hi Emma,

Thank you for checking into this. I did not realize the issue was a known problem.

I had noticed a few non-live CIDs in the past but it looks like a significant number of these changes were made in Jan 2019. Fortunately the Pubchem hyperlinks to Pubchem from MassBank still work to bring up the non-live CIDs with the references to the new ones. (I usually type the numbers in!)

To check a large number of CIDs to see if they are still active, you can use the Pubchem Identifier Exchange Service and then choose to convert CIDs into CIDs with output into two columns. Non-live CIDs will return a blank in the second column. I asked the Pubchem folks last week for a cross-index of non-live CIDs to active CIDs and apparently there is none.

Best,

Dan

schymane commented 2 years ago

Thanks Dan - an alternative way of tackling it would be to map CIDs using the InChIKey in the records (which will return the best current CID)... and see where they differ. The InChIKeys themselves should not have changed. It depends what you are trying to do.

Yes, in 2019 PubChem switched the software behind the scenes which resulted in this shift of CIDs - so it will potentially affect a fraction of the records contributed before then (which is most of our records); and maybe some after. But due to the construct of MassBank, and since some of them shift periodically still, the fix is not trivial ... but we can discuss amongst us what to do. We're also in contact with PubChem as necessary.

@meier-rene we could also map back current CIDs via our deposition files ... but the API is likely the easier option.

Thanks, Emma

meier-rene commented 2 years ago

Thank you Dan for reporting. Emma already gave some information about this issue. I know that Pubchem CIDs in MassBank are not correct for several reasons.

I could now write a lengthy paragraph explaining all the complication we have with external database identifiers(which I actually already did below), but rather I would like to explain how I see external database identifiers in MassBank. I consider them as nice to have possibility to link out to external resources. They are not always stable and not always correctly deposited. I see no chance to get them always correct. The main identifier in MassBank for compounds are the InChI and the SMILES.

Lengthy paragraph What could be a solution? First thing is, that we need to prevent the acceptance of new contributions with errors. Thats not trivial to integrate in our standard validation procedure due to runtime problems. Communication with PUG REST is too slow to have this as routine procedure. So I need to set up a mechanism which distinguishes between existing and new records. Thats certainly possible but not in place. I think I should give more priority to this one.

Second we have to fix our existing records. The code to query the PUG REST and set the CID in the record files exists. But I'm hesitating to apply this in a blind way to all records. We have several identifiers for the chemical structure in the record files and there are sometimes mistakes/mismatches. It helps the have the whole unchanged information to find out what is correct and whats a mistake. Conclusion: flagging errors is easy, but fixing remains a partly manual thing. That has low priority on my task list.

schymane commented 2 years ago

Some additional tips from Paul from PubChem:

We have a file on the FTP site that has a mapping from old to new CID:

https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-Preferred.gz

And that includes this example: 7682 676454

You can also get this with PUG REST:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/7682/cids/JSON?cids_type=preferred

{
  "IdentifierList": {
    "CID": [
      676454
    ]
  }
}
meier-rene commented 2 years ago

Thanks Emma, that's great. With this information I can easily update old CID to new CID.

meier-rene commented 2 years ago

I'm working on this atm. Here are some numbers which I extracted, thanks to the great list which Emma pointed to...

There are 67 CID referenced in MassBank which are "non-live". There are 672 records effected by this problem.