MassBank / MassBank-data

Official repository of open data MassBank records
68 stars 55 forks source link

Two different chemical names appear in the file and point to different molecular entities #255

Open LiuLime opened 4 months ago

LiuLime commented 4 months ago

Hello,

When I am cleaning the massbank data, I found that:

(1) Confusing name records

In MSBNK-ACES_SU institute provided datasets, there are suspicious data records. They provided two different chemical names in files which represented totally different moleculars.

截屏2024-02-22 12 09 26

According to exactmass and formula in the example, I personally judged that the first name record is correct, the second one, for some reason, maybe uploaded by false conduction.

Here are a few suspicious records I found (more may exist):

MSBNK-ACES_SU-AS000181
MSBNK-ACES_SU-AS000133
MSBNK-ACES_SU-AS000121
MSBNK-ACES_SU-AS000110
MSBNK-ACES_SU-AS000089
MSBNK-ACES_SU-AS000004
MSBNK-ACES_SU-AS000160
MSBNK-ACES_SU-AS000201

(2) In MASSBANK_Athens_Univ record, CAS number may indicate different form to the molecular author may want to upload.

截屏2024-02-22 13 03 46

The blue arrow indicate that uploaded CAS number searched result, and red arrow indicate the correct CAS number I think.

Cause CAS number is very useful for characterizing molecules precisely, especially for distinguishing isomers (better than inchi or inchikey which sometimes cannot distinguish isomers with different conformation), it would be helpful if they are correct :)

Thank you very much!

schymane commented 4 months ago

Thank you for the feedback about these records...

@meier-rene I can confirm that it seems MSBNK-Athens_Univ-AU113805 and MSBNK-Athens_Univ-AU113804 should have the CAS number corrected to 20574-50-9. The CAS number appears to be correct in MSBNK-Athens_Univ-AU113801. Do you wish to do this update your side?

For some reason these records are not appearing in PubChem, I have reported this separately, but the CAS is aligned with the CAS on PubChem where the equivalent MoNA records appear as well.

I will need more time to check the name issue.

schymane commented 4 months ago

Regarding the names in the ACESx spectra, some look OK to me, some look like they need fixing. If you disagree with my assessment can you please provide more exact details in your report so we know what exact issue you mean?

Records that you flagged that seem OK to me

Records that you flagged that seem to have errors that need fixing

@meier-rene how do you wish to coordinate this?

schymane commented 4 months ago

Oh and re this:

For some reason these records are not appearing in PubChem, I have reported this separately.

...Jeff found the issue and is reparsing our data, should be fixed soon.

LiuLime commented 4 months ago

Thank you very much for your quick response, Now I understand the partial rule for uploaded data.

Here I explain the way I compared: In order to confirm the inchi ID is correct, I compared the origin recorded inchi with the database recorded inchi.

Database inchi is obtained by searching identifier (the priority is CAS > pubchem CID > Chebi ID > KEGG ID > name). If origin recorded inchi is different with Database inchi, then I marked it as "inconsistent object".

I think the problem caused by two points:

Please see the result I collated in this excel. ans.xlsx


Next I will re-analyze the data by using PubChem only instead of using NIH to search identifier, thank you very much for your kind explanation.

schymane commented 4 months ago

Hi @LiuLime please note that as a structure-oriented database (since the mass spectra are connected to the structures) our MassBank validation procedures differ slightly from yours. Our highest priority goes to SMILES (the displayed structure comes directly from the SMILES), from there we check mass, formula, InChI and InChIKeys, and the database identifiers are secondary information and often retrieved and provided by contributors in a variety of ways. Some we can validate, but not all. Since CAS numbers are not public and the database requires a license, we cannot validate these automatically from the original source, so these should be taken with caution. Likewise, we are unable to validate ChemSpider identifiers as they do not provide an unlimited API. Often we have for instance people who provide the CAS of the standard (sometimes a salt form), but the structure itself is the neutral molecule as this is what the spectrum corresponds to. Hence the SMILES (and corresponding InChI) should be your reference where possible. For the NAME field, you should give priority to the first entry, not subsequent ones, as this is also our display name and we give priority to this first field. Please note we have extensive documentation and the validation code is here. You can also refer to our Record Specification to see the priority we give to entries (compulsory vs optional).

As an update, ACES have provided us with some fixes to their records and we are discussing how to implement - thanks for identifying the issues!

tsufz commented 4 months ago

@LiuLime, Thanks so much for your issue reporting. Just to add to @schymane comments. The search on CAS is sometimes tricky as different CAS numbers are "true". Some databases provide the current active CAS number. CASfinder (of course) and also the US EPA Chemical Dashboard. Others may show other CAS, which are not wrong, but maybe outdated.

@schymane, this is an interesting topic. Do you know which CAS is provided by a PUG query?

(I cannot check for the given examples as MassBank is down).

schymane commented 4 months ago

PubChem cannot return CAS directly, they can only provide CAS numbers that are given as synonyms and these are then depositor contributed (this is one of their examples). image

The "Full Record Retrieval" does not appear to retrieve headings in the records to me, I guess this would have to be done via annotations somehow. @PaulThiessen please correct me if I am wrong ...

PaulThiessen commented 4 months ago

Yeah it's complicated... Some CAS in PubChem from from depositor-supplied synonyms as Emma mentioned above. And some come from totally separate 3rd party annotation streams, e.g.

https://pubchem.ncbi.nlm.nih.gov/compound/2244#section=CAS

We don't have a way to get CAS-as-synonym specifically out of the list of depositor synonyms (though it's something we're thinking about how to do). You can get it from annotations like this, although it's a structured response...

https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/2244/JSON?heading=CAS

Note that PubChem's coverage of CAS is far from complete, and we aren't authoritative - there could be errors or conflicting results.

LiuLime commented 4 months ago

@schymane Thank you so much for explanation and providing these links, they are helpful. @tsufz @PaulThiessen Thank you so much for discussion about CAS, this explains my confusion about CAS in records. I have thought that records with CAS come from commercial standards which the agents would provide accurate CAS number to depositors, so it was quite unexpected when found out unusual examples of CAS.

It seems that for massbank data, it's better to use isomeric smiles instead of these database identifiers.


Another immature idea, how about using NCI/CADD chemical identifier solver to search for CAS?

Results could obtain by constructing query url. Example

The potential issue is that NIH may differ from pubchem at some instances, and it also make sense to use only pubchem for every identifier query to keep consistency.

May I have your thoughts?

Have a great day!

schymane commented 4 months ago

Indeed for MassBank it's better to rely on the structural data, as this is our focus. Our records are provided by a variety of contributors, who each use a variety of different ways to find and supply CAS - some take this from the standards, others from services like PubChem and CACTVS like you point out. One of our workflows, RMassBank, uses a combination of these (see here). Both CACTVS and PubChem are from NIH, but different sections (NCI vs NCBI) and both have the same issue with CAS - the only authoritative source is the CAS Registry which requires a license, which makes it difficult to verify any CAS numbers in a fully open science workflow such as MassBank (PubChem have the same issue). The CAS Common Chemistry set exists but it is only 500K compounds and does not cover all the compounds in MassBank. What you exactly wish to do in your workflow is not clear to me and why you e.g. want to rely on CAS instead of other structural identifiers, we give these lower priority than other identifiers for the reasons above (but display them because users find them useful) and collaborate with PubChem and other resources to provide the best alternatives we can (more thoughts on identifiers in DOI: 10.1186/s13321-021-00520-4). For our integration with PubChem we rely on SMILES, InChI, InChIKeys and thus have PubChem CIDs in PubChem corresponding to all our records (barring a few special cases). See DOI: 10.1039/D3EM00181D and the MassBank EU Data Source.