MassBank / MassBank-data

Official repository of open data MassBank records
76 stars 60 forks source link

CASMI2016 records with compound/spectrum mismatch #9

Open schymane opened 6 years ago

schymane commented 6 years ago

User reported that SM858902 and SM858951 contain spectral data from acetylsulfamethoxazole but are labeled diphenhydramine (thank you!). Upon closer inspection we seem to have had an ID/Precursor&peaks mismatch for 3 IDs / 4 records in a series, surrounded by records that look OK; series "broken" due to missing IDs in the middle. We also need to find the cause in https://github.com/MassBank/RMassBank

This should not be passing any form of validation; a screening of the entire CASMI2016 database would be extremely useful for debugging the cause and flagging how and how many records to fix, thank you @meier-rene in advance if you can :-)

From what I can see: **this one looks OK. ACCESSION: SM858203 RECORD_TITLE: Cetirizine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+ CH$FORMULA: C21H25ClN2O3 CH$EXACT_MASS: 388.15537 MS$FOCUSED_ION: PRECURSOR_M/Z 389.1626 389.1626 C21H26ClN2O3+ 1 389.1626 -0.05

**this one looks OK. ACCESSION: SM858353 RECORD_TITLE: 2-Hydroxycarbamazepine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M-H]- CH$FORMULA: C15H12N2O2 CH$EXACT_MASS: 252.08988 MS$FOCUSED_ION: PRECURSOR_M/Z 251.0826 251.0827 C15H11N2O2- 1 251.0826 0.4

[no records with IDs between 8583 and 8588]

** here something has gone wrong ACCESSION: SM858801 RECORD_TITLE: Finasteride; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+ CH$FORMULA: C23H36N2O2 CH$EXACT_MASS: 372.27768 MS$FOCUSED_ION: PRECURSOR_M/Z 256.1696

** here something has gone wrong ACCESSION: SM858902 RECORD_TITLE: Diphenhydramine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+ CH$FORMULA: C17H21NO CH$EXACT_MASS: 255.16231 MS$FOCUSED_ION: PRECURSOR_M/Z 296.07

** still wrong ... it's using the same (wrong) exact mass to get equivalent wrong precursor ACCESSION: SM858951 RECORD_TITLE: Diphenhydramine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M-H]- CH$FORMULA: C17H21NO CH$EXACT_MASS: 255.16231 MS$FOCUSED_ION: PRECURSOR_M/Z 294.0554

** still wrong: ACCESSION: SM859002 RECORD_TITLE: Acetyl-sulfamethoxazole; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+ CH$FORMULA: C12H13N3O4S CH$EXACT_MASS: 295.06268 MS$FOCUSED_ION: PRECURSOR_M/Z 325.1711 325.171 C20H22FN2O+ 1 325.1711 -0.17 <= we have F annotations!!!!!

[no 8591]

** and now everything seems OK again ... ACCESSION: SM859203 RECORD_TITLE: Amitriptyline; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+ CH$FORMULA: C20H23N CH$EXACT_MASS: 277.18305 MS$FOCUSED_ION: PRECURSOR_M/Z 278.1903 278.1904 C20H24N+ 1 278.1903 0.42

schymane commented 6 years ago

So, I just ran getMBRecordInfo (https://github.com/schymane/ReSOLUTION/) on the directory, extracting precursor and exact mass automatically from CASMI2016 from the OpenData SVN; checking the difference flags exactly and only these 4 records as having a mass difference above/below ~1.007 SM858801, SM858902, SM858951, SM859002

schymane commented 5 years ago

Thanks to diagnosis from Herbert Oberacher the case is now clear (see issue online for case history):

SM858801 is diphenhydramine SM858902 and SM858951 are Acetyl-sulfamethoxazole SM859002 is citalopram

So, how to update? If I update the compound information to match the spectra then we will have a mismatch between the internal IDs, UFZ IDs and the MassBank accession numbers. However if I change to the correct internal IDs we'll be changing accession numbers and I think this is worse. If I hear nothing back I will correct the compound information in these four records and send along updates when I get a chance.

@meier-rene @tsufz @meowcat

meier-rene commented 5 years ago

Is deleting the incorrect records and adding new and correct records an option?

schymane commented 5 years ago

Well, the records need to be fixed, this is for sure. However, if I correct the processing error, we will end up with new accession numbers. I am not sure this is the right way to fix it in this case though. This is the compound list ... it is still inexplicable how this happened as it's kind of impossible the way that RMassBank works, but something certainly went wrong! According to the compound list, 8588 is certainly meant to be Finasteride but ended up as the compound info of finasteride with the spectral data of diphenhydramine ... do you see the problem? If I now reprocess then the SM858801 record will turn into SM858901 and SM858902 will become SM859002 ... I think best would be to update the compound info with the current accession numbers otherwise we are going to run into awful versioning problems?

image

meowcat commented 5 years ago

I understand the problem - is it a reasonable option to upload the records under a tag that is not SM? In that way the new, say SZ records will have the correct internal ID, and the old ones should be marked obsolete... Just an idea. Not yet thought through.

schymane commented 5 years ago

Quite honestly I don’t really want to reprocess them all as it was an incredibly complicated process and it’s only three records (although I have a few others with issues too). It’s made more difficult by the fact that we have several scans (multiple precursors but one CE) and thus the last number in the accession also shifts … it is highly unlikely we’ll use those internal IDs again as it was a once-off dataset. I’d prefer for now to update the compound info and leave a trace in the COMMENT field. We have a couple of others we’ll likely have to deprecate and a couple where I need input from Martin first.

schymane commented 5 years ago

OK here goes with a complicated update to address issues in the CASMI spectra, I suggest @meier-rene implement this at the MassBank-data side, and I'll double check to confirm once done, and comment the commit where necessary (@meier-rene other ideas welcome if you see an alternative). This has been double checked with the data source (Martin Krauss). Note for the record: NONE of these issues actually affected the CASMI contest. It was an inadvertent upload of files that were extracted but eliminated during quality control for the contest. But we need to fix the database now ;-)

https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM872102 This is a spectrum of Exemestane (identical SPLASH), please update the compound information in SM872102 to match the compound information of Exemestane in this record: https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM873802&dsn=CASMI_2016

https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM871901 This is a spectrum of Trenbolone (identical SPLASH), please update the compound information in SM871901 to match the compound information of Trenbolone in this record: https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM874601&dsn=CASMI_2016

https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM840901 This should be simazine, please take the compound information from SM841901 The analytical information is correct.

https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM841901 This should be Desethylterbutylazine, please take the compound information from SM840901. The analytical information is correct.

The other ones we need to correct are indicated above, i.e. SM858801 is diphenhydramine => please take compound information from the current https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM858902&dsn=CASMI_2016 The analytical information is correct.

SM858902 and SM858951 are Acetyl-sulfamethoxazole => please take compound information from the current https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM859002&dsn=CASMI_2016 The analytical information is correct.

SM859002 is citalopram => please take compound information from an existing record, for instance: https://massbank.eu/MassBank/RecordDisplay.jsp?id=EA290112&dsn=Eawag The analytical information is correct.

With compound information I'm referring to the CH$ entries, ie CH$NAME: Diphenhydramine CH$NAME: 2-benzhydryloxy-N,N-dimethylethanamine CH$COMPOUND_CLASS: N/A; Environmental Standard CH$FORMULA: C17H21NO CH$EXACT_MASS: 255.16231 CH$SMILES: CN(C)CCOC(c1ccccc1)c1ccccc1 CH$IUPAC: InChI=1S/C17H21NO/c1-18(2)13-14-19-17(15-9-5-3-6-10-15)16-11-7-4-8-12-16/h3-12,17H,13-14H2,1-2H3 CH$LINK: CAS 58-73-1 CH$LINK: CHEBI 4636 CH$LINK: KEGG D00300 CH$LINK: PUBCHEM CID:3100 CH$LINK: INCHIKEY ZZVUWRFHKOJYTH-UHFFFAOYSA-N CH$LINK: CHEMSPIDER 2989 CH$LINK: COMPTOX DTXSID4022949

tsufz commented 5 years ago

@schymane Who should curate this data?

schymane commented 5 years ago

I hoped @meier-rene could do this but if not someone just needs to update the files, all the info is there ...

tsufz commented 5 years ago

@schymane Come on, you did generate them, why you don't curate them by yourself or have them been copied from for example UFZ records?

schymane commented 5 years ago

At one point Rene said he'd do things centrally. This one is tough and I see why he didn't update it, I'll do it when I have a chance but I currently don't have time. Likely during Biohackathon. If you get to it first I'll be overjoyed. If not I'll do it when I get the chance ..

tsufz commented 5 years ago

Okay, who first comes, serves first.

schymane commented 5 years ago

So, the movement to dev branch after I had forked the MassBank-data repo has caused a lot of unexpected issues. @meier-rene is walking me through fixing this, before we will be able to change anything. I've had to delete the whole repo and hope that starting from scratch will fix things. Still cloning ..