MassBank / MassBank-data

Official repository of open data MassBank records
77 stars 60 forks source link

Incorrect exact mass for RIKEN/PR3* spectra #125

Open bachi55 opened 4 years ago

bachi55 commented 4 years ago

Hei,

I stumbled into an issue with the RIKEN/PR3* spectra. It seems, that the exact mass is not correctly calculated. Let's look at the following example:

PR302491.txt

CH$FORMULA: C27H32O15
CH$EXACT_MASS: 596.538
CH$SMILES: C[C@@H]1O[C@@H](OC[C@H]2O[C@@H](OC3=CC(O)=C4C(=O)C[C@H](OC4=C3)C3=CC(O)=C(O)C=C3)[C@H](O)[C@@H](O)[C@@H]2O)[C@H](O)[C@H](O)[C@H]1O

If I use an online tool to calculate the exact mass from the molecular formula, I get: 596.174125 (diff ~0.35).

I also calculated the exact mass using RDKit directly from the SMILES. I get: 596.1741203239999 (diff ~ 0.35).

When I check the compound in PubChem (searched by InChIKey) than I get: 596.17412.

Actually, the molecular weight in PubChem is pretty close the reported exact mass in the spectra file: 596.5 vs. 596.538.

I attached (see below) a Python script to run a comparison on reported and calculated (using RDKit) exact mass. I ran it for the RIKEN spectra files with an absolute tolerance of 0.001. Only the PR3*.txt seems to be effected.

I believe the files need a curation.

Best regards, Eric

import sys
import os
import glob

from math import isclose

from rdkit.Chem import MolFromSmiles
from rdkit.Chem.rdMolDescriptors import CalcExactMolWt, CalcMolFormula

MF_PATTERN = "CH$FORMULA:"
EXACT_MASS_PATTERN = "CH$EXACT_MASS:"
SMILES_PATTERN = "CH$SMILES:"

if __name__ == "__main__":
    # Directory containing the RIKEN spectra files
    idir = sys.argv[1]

    # Iterate overall ms-files in the directory
    for msfn in sorted(glob.glob(os.path.join(idir, "*.txt"))):
        with open(msfn, "r") as msfile:
            # Read information from file: Molecular Formula, Exact Mass and SMILES
            line = msfile.readline().strip()
            while line:
                # Extract molecular formula
                if line.startswith(MF_PATTERN):
                    mf_file = line[(len(MF_PATTERN) + 1):]
                # Extract exact mass
                elif line.startswith(EXACT_MASS_PATTERN):
                    exact_mass_file = float(line[(len(EXACT_MASS_PATTERN) + 1):])
                # Extract SMILES
                elif line.startswith(SMILES_PATTERN):
                    smiles_file = line[(len(SMILES_PATTERN) + 1):]

                line = msfile.readline().strip()

        # We skip molecules that are intrinsically charged, as those might not be correctly handled by rdkit
        if mf_file.endswith("+"):
            continue

        # Calculate Molecular Formula and Exact Mass from the given SMILES and compare
        mol = MolFromSmiles(smiles_file)
        mf_smi = CalcMolFormula(mol)
        exact_mass_smi = CalcExactMolWt(mol)

        if mf_smi != mf_file:
            print("%s: MF (ms-file vs. rdkit) '%s' - '%s'" % (os.path.basename(msfn), mf_file, mf_smi))

        if not isclose(exact_mass_file, exact_mass_smi, abs_tol=1e-3):
            print("%s: Exact Mass (ms-file vs. rdkit) %f - %f = %f" % (os.path.basename(msfn), exact_mass_file,
                                                                       exact_mass_smi, exact_mass_file - exact_mass_smi))
tsufz commented 4 years ago

Yes, this true. The given mass is the molar mass. @meier-rene, this is an issue, we should check using the validator (and fix it?).