MassBank / MassBank-data

Official repository of open data MassBank records
68 stars 55 forks source link

Extended SMILES and mismatching identifiers in MSBNK-EPA-ENTACT_AGILENT001211 to MSBNK-EPA-ENTACT_AGILENT001216 #251

Closed schymane closed 1 week ago

schymane commented 7 months ago

Hi @alexchao32

Thanks for your new records in https://github.com/MassBank/MassBank-data/pull/244 :-)

We've just crunched the data for PubChem and one SMILES failed the deposition:

CC[N+](CC1=CC(=CC=C1)S(O)(=O)=O)=C1C=CC(C=C1)=C(C1C=CC=CC=1)C1C=CC(=CC=1)N(CC1C=C(C=CC=1)S(O)(=O)=O)CC |c:14,t:21|

Is there any reason for using the extended SMILES format (the |c:14,t:21| at the end)?

I found entries matching, or closely matching, the InChIKey and SMILES in both PubChem and CompTox, but see no evidence of an extended SMILES anywhere (I also asked @ChemConnector if he knew more). The DTXSID (DTXSID3020671) and the PubChem CID in the record (20803) actually points to a salt species.

The SMILES (without end) and InChIKey in the record would point to CID 20804. We fixed the deposition to use the InChIKey and ignore the SMILES.

Ideally, we'd need to clean up the identifiers in this record - do we need the extended SMILES or can the |c:14,t:21| be trimmed? Should we update DTXSID and PubChem CID to match the InChIKey? The molecular formula and mass match the InChIKey. The parent DTXSID5048001, however, has a different charge, and they do not seem to have an entry matching the InChIKey exactly.

(@meier-rene do we need to add new checks to the validation?)

These are the corresponding records:

MSBNK-EPA-ENTACT_AGILENT001211
MSBNK-EPA-ENTACT_AGILENT001212
MSBNK-EPA-ENTACT_AGILENT001213
MSBNK-EPA-ENTACT_AGILENT001214
MSBNK-EPA-ENTACT_AGILENT001215
MSBNK-EPA-ENTACT_AGILENT001216

Please let us know what you think and whether we should fix, or if you'd prefer to submit new versions of these records.

Thanks, Emma

schymane commented 7 months ago

Brief update: email discussions are ongoing with @alexchao32 and Andrew McEachran - the error came from the Agilent PCDL and the current plan is to update their side ... stay tuned!

schymane commented 3 weeks ago

Hi all, just pinging @alexchao32 and @ChemConnector again on this one because lo and behold exactly this case just failed the PubChem deposition again due to the presence of extended SMILES ... can you please let us know how to update this record because (a) I see no reason for the extended SMILES, which caused deposition to fail and (b) the information mismatches between salt and non-salt species and it's not quite clear to me how to resolve.

https://massbank.eu/MassBank/RecordDisplay?id=MSBNK-EPA-ENTACT_AGILENT001214

image

https://comptox.epa.gov/dashboard/chemical/details/DTXSID3020671

https://pubchem.ncbi.nlm.nih.gov/compound/C.I.-Acid-green-3#section=DSSTox-Substance-ID https://pubchem.ncbi.nlm.nih.gov/compound/20803

https://comptox.epa.gov/dashboard/chemical/details/DTXSID50859883 (does not exist)

Quoting from before:

Is there any reason for using the extended SMILES format (the |c:14,t:21| at the end)?

I found entries matching, or closely matching, the InChIKey and SMILES in both PubChem and CompTox, but see no evidence of an extended SMILES anywhere (I also asked @ChemConnector if he knew more). The DTXSID (DTXSID3020671) and the PubChem CID in the record (20803) actually points to a salt species.

The SMILES (without end) and InChIKey in the record would point to CID 20804. We fixed the deposition to use the InChIKey and ignore the SMILES.

Ideally, we'd need to clean up the identifiers in this record - do we need the extended SMILES or can the |c:14,t:21| be trimmed? Should we update DTXSID and PubChem CID to match the InChIKey? The molecular formula and mass match the InChIKey. The parent DTXSID5048001, however, has a different charge, and they do not seem to have an entry matching the InChIKey exactly.

schymane commented 3 weeks ago

@ChemConnector does not see a reason for the extended SMILES

schymane commented 3 weeks ago

@ChemConnector recommends we replace with this information and @alexchao32 agrees. https://comptox.epa.gov/dashboard/chemical/details/DTXSID5048001

@meier-rene will you look after this or should UniLu ( @anjuraj15 or @schymane ) try to update this for you?

schymane commented 2 weeks ago

Update: @anjuraj15 will submit a patch for @meier-rene to check and merge if OK. The precursor mass is a bit problematic now, if this is something that won't get through validation we might need to discuss further, or deprecate. For now I have set it as a non-standard adduct [M+2H]+. If anyone has a better idea how to deal with this let us know! The intensities look rather low ..

RECORD_TITLE: FD and C Green No. 1; ESI-QTOF; MS2; CE: 40; [M+2H]+
…
CH$NAME: FD and C Green No. 1
CH$NAME: DTXSID5048001
CH$COMPOUND_CLASS: N/A
CH$FORMULA: C37H36N2O6S2
CH$EXACT_MASS: 668.201479
CH$SMILES: CCN(CC1=CC(=CC=C1)S(O)(=O)=O)C1=CC=C(C=C1)C(C1=CC=CC=C1)=C1C=CC(C=C1)=[N+](CC)CC1=CC(=CC=C1)S([O-])(=O)=O
CH$IUPAC: InChI=1S/C37H36N2O6S2/c1-3-38(26-28-10-8-14-35(24-28)46(40,41)42)33-20-16-31(17-21-33)37(30-12-6-5-7-13-30)32-18-22-34(23-19-32)39(4-2)27-29-11-9-15-36(25-29)47(43,44)45/h5-25H,3-4,26-27H2,1-2H3,(H-,40,41,42,43,44,45)
CH$LINK: CAS 6638-02-4
CH$LINK: INCHIKEY SRRJCDUOSQWHGS-UHFFFAOYSA-N
CH$LINK: PUBCHEM CID:73226
…
MS$FOCUSED_ION: PRECURSOR_TYPE [M+2H]+
meier-rene commented 2 weeks ago

Hi @schymane, @anjuraj15 has prepared the fix. We dont validate precursor mass, nothing will break. Do you think its necessary to make this a hotfix or can we wait the next data release?

Besides the question of data release preparation I don't understand how that molecule works. I'm not an MS expert. If you have time, maybe you can explain it to me.

The neutral molecule has a monoisotopic mass of 668.2

In neg mode (MSBNK-EPA-ENTACT_AGILENT001211.txt) it gets deprotonated. So I would expect the precursor to be [M-H]- with a precursor mz of of 667.2. But I must be wrong because the spectrum has a basepeak of 668.2 and the record says [M-2H]- and precursor mass 668.2.

In pos mode (MSBNK-EPA-ENTACT_AGILENT001212.txt) it gets protonated. So I would expect the precursor to be [M+H]+ with a precursor mz of of 669.2. But again I must be wrong because the spectrum has a basepeak of 670.2 and the record says [M+2H]+ and a precursor mass 670.2.

I don't understand how that all fits together. The mass difference between pos mode basepeak and neg mode basepeak should be 4 given [M-2H]- and [M+2H]+. But its just 2.

It would make sense if the monoisotopic mass would be 669.2, but thats the positive charged molecule. Is it possible that the charged molecule [C37H37N2O6S2]+ picks up an electron during ESI and turns neutral? Then it would be a radical. Does that happen during ESI? I dont understand this molecule...

schymane commented 2 weeks ago

Hi @meier-rene you have hit the nail on the head. In correcting the chemical information to the correct species, we now have mismatches with the precursors that don't make chemical sense. It is why I thought we may actually just want to deprecate these records (I think the data has been extracted using the wrong masses).

I have commented the pull request from Anjana, a quick fix for the current state would be to update the negative record to [M]- (apologies, my bad, I gave her an example for pos and forgot to cover the neg case)

Or ... we provide chemical information for the positive species (removing the extended SMILES part) so that the precursor and adduct states match (but it still doesn't make chemical sense). We can't update the precursor mass as they are clearly the masses in the peak lists. But for that case, the positive species [M+H] would be 2+ (z is wrong) or M+ (precursor mass is wrong) and the negative [M-H]- would be neutral not neg ... so basically, no solution is good here.

I don't think we need a hotfix, I saw that Michele just committed a whole lot of Eawag spectra, maybe we can aim for a 2024.07 release adding new spectra and fixing this issue?

For these spectra, do you prefer deprecation, positive species (previous state, but still needs SMILES patched and other identifiers to be checked, we would have to do a different update), or neutral species with strange adducts (fixing the neg one to [M]-)? The more I think about it, the more I lean to deprecation.

meier-rene commented 1 week ago

I changed the PRECURSOR_TYPE for the neg mode spectra. Now the SMILES and PRECURSOR_TYPE and PRECURSOR_M/Z and peaks are "ok". Just the [M]- and [M+2H]+ are a bit weird, because these would be open shell species. I`m not an expert what kind of chemistry happens on the tip of such an ESI needle, but given the fact that this molecule is a huge resonance stabilized molecule I would not totally rule that out.

However, the issues with the PubChem deposition should be solved and I will close that issue.