If I use an online tool to calculate the exact mass from the molecular formula, I get: 596.174125 (diff ~0.35).
I also calculated the exact mass using RDKit directly from the SMILES. I get: 596.1741203239999 (diff ~ 0.35).
When I check the compound in PubChem (searched by InChIKey) than I get: 596.17412.
Actually, the molecular weight in PubChem is pretty close the reported exact mass in the spectra file: 596.5 vs. 596.538.
I attached (see below) a Python script to run a comparison on reported and calculated (using RDKit) exact mass. I ran it for the RIKEN spectra files with an absolute tolerance of 0.001. Only the PR3*.txt seems to be effected.
I believe the files need a curation.
Best regards,
Eric
import sys
import os
import glob
from math import isclose
from rdkit.Chem import MolFromSmiles
from rdkit.Chem.rdMolDescriptors import CalcExactMolWt, CalcMolFormula
MF_PATTERN = "CH$FORMULA:"
EXACT_MASS_PATTERN = "CH$EXACT_MASS:"
SMILES_PATTERN = "CH$SMILES:"
if __name__ == "__main__":
# Directory containing the RIKEN spectra files
idir = sys.argv[1]
# Iterate overall ms-files in the directory
for msfn in sorted(glob.glob(os.path.join(idir, "*.txt"))):
with open(msfn, "r") as msfile:
# Read information from file: Molecular Formula, Exact Mass and SMILES
line = msfile.readline().strip()
while line:
# Extract molecular formula
if line.startswith(MF_PATTERN):
mf_file = line[(len(MF_PATTERN) + 1):]
# Extract exact mass
elif line.startswith(EXACT_MASS_PATTERN):
exact_mass_file = float(line[(len(EXACT_MASS_PATTERN) + 1):])
# Extract SMILES
elif line.startswith(SMILES_PATTERN):
smiles_file = line[(len(SMILES_PATTERN) + 1):]
line = msfile.readline().strip()
# We skip molecules that are intrinsically charged, as those might not be correctly handled by rdkit
if mf_file.endswith("+"):
continue
# Calculate Molecular Formula and Exact Mass from the given SMILES and compare
mol = MolFromSmiles(smiles_file)
mf_smi = CalcMolFormula(mol)
exact_mass_smi = CalcExactMolWt(mol)
if mf_smi != mf_file:
print("%s: MF (ms-file vs. rdkit) '%s' - '%s'" % (os.path.basename(msfn), mf_file, mf_smi))
if not isclose(exact_mass_file, exact_mass_smi, abs_tol=1e-3):
print("%s: Exact Mass (ms-file vs. rdkit) %f - %f = %f" % (os.path.basename(msfn), exact_mass_file,
exact_mass_smi, exact_mass_file - exact_mass_smi))
Hei,
I stumbled into an issue with the RIKEN/PR3* spectra. It seems, that the exact mass is not correctly calculated. Let's look at the following example:
PR302491.txt
If I use an online tool to calculate the exact mass from the molecular formula, I get:
596.174125
(diff ~0.35).I also calculated the exact mass using RDKit directly from the SMILES. I get:
596.1741203239999
(diff ~ 0.35).When I check the compound in PubChem (searched by InChIKey) than I get:
596.17412
.Actually, the molecular weight in PubChem is pretty close the reported exact mass in the spectra file:
596.5
vs.596.538
.I attached (see below) a Python script to run a comparison on reported and calculated (using RDKit) exact mass. I ran it for the RIKEN spectra files with an absolute tolerance of 0.001. Only the
PR3*.txt
seems to be effected.I believe the files need a curation.
Best regards, Eric