MassBank / RMassBank

Playground for experiments on the official http://bioconductor.org/packages/devel/bioc/html/RMassBank.html
Other
12 stars 15 forks source link

SMILES / InChI(Key)+identifier inconsistencies in RMassBank-generated records #331

Open schymane opened 1 year ago

schymane commented 1 year ago

Hi @meowcat @meier-rene (CC @anjuraj15 and @PaulThiessen)

We had a bizarre case of existing (3 year old) ENTACT records fail validation when we updated only unrelated (textual) information. Turns out the SMILES contained stereochemistry information, but the InChI, InChIKey and all related identifiers didn't, which then failed @meier-rene 's updated validation suite.

Here are the SMILES in question:

ClC1=CC=C(CN2CCS\C2=N/C#N)C=N1
CN(C)C1=CC=C(C=C1)\N=N\C1=C(C=CC=C1)C(O)=O
OC(=O)C1=CC(=CC=C1O)\N=N\C1=CC=C(C=C1)S(=O)(=O)NC1=NC=CC=C1
CCCOC\C(=N/C1=C(C=C(Cl)C=C1)C(F)(F)F)N1C=CN=C1
NC1=CC=C(C=C1)\N=N\C1=CC=CC=C1

Turns out that they standardize to the non-stereochemistry form in PubChem standardizer, and presumably also Cactvs - which may explain how everything after InChIKey ended up as the "stereochemistry-neutral" form. The only way we could get these records to pass validation was to adjust to the non-stereo SMILES, rather than having to update all InChI and identifier fields. See example before and after change (after with _ES and end) and the log.

Not sure if we have to build a check into RMassBank to catch this, @meowcat have you ever seen any cases like this? @meier-rene are there any other existing records that have this issue?

log.txt MSBNK-LCSB-LU005205.txt MSBNK-LCSB-LU005205_ES.txt

meier-rene commented 1 year ago

Hi @schymane, We have several thousands of these mismatches in our data at MassBank. Its not trivial to fix and requires manual work in most cases. That's why I silently accept this error in existing records but try to prevent new records with this problem from entering our collection. There is a whitelist for existing records to pass validation if they have this particular issue.

Your new contribution is a good opportunity to solve it for LCSB data. For the LCSB contributions its really just the 5 compounds you listed. It seems to be related to cis/trans imine or diimine and in general they are not stable and undergo slow conversion. Its questionable if these spectra should be annotated with one particular isomer form my point of view. In general I would support to remove cis trans information from these records. I will look into this in detail.

schymane commented 1 year ago

OK great, this would explain it.

For the LCSB records we removed the stereochemistry information in that commit cross-referenced above, which would agree with both your general reasoning and the PubChem / CACTVS behaviour. Seems like the right solution overall for now.

Let me know if we should take a look at this for the other mismatches in MassBank, this will also affect which records end up being annotated with the spectra in PubChem ...

meowcat commented 1 year ago

Hi, in fact we have a similar issue for new records. The issue is broader anyway, since I am never quite sure what stereochemistry to include in records. Frequently I defaulted to wiping out stereochemistry at the SMILES level and not claiming that the spectrum is related to any specific stereoisomer. I think this is the right thing to do for molecules with one stereocenter; but diastereomers (especially natural products with many stereocenters) might be distinguishable by LC and perhaps even in some cases by MS2.