MassBank / MassBank-data

Official repository of open data MassBank records
76 stars 60 forks source link

External report: issues with conflicting stereochemistry in identifiers #70

Open schymane opened 5 years ago

schymane commented 5 years ago

Copy-paste from email received; @meier-rene are you able to follow-up? Thx!

Comparing data from different databases, I found some discrépancies between your data. For the mentioned entry of your database (https://massbank.eu/MassBank/RecordDisplay.jsp?id=OUF00136), the chemical structure indicates that the configuration of the double bond is not defined. This configuration is defined in other databases as InChIKey CWVRJTMFETXNAD-NCZKRNLISA-N:

See:

PubChem: https://pubchem.ncbi.nlm.nih.gov/compound/9476 ChEBI: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:95271 ChEMBL: https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL3186431/ EPA: https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID3024786

Could you check please if the definition of your entry is correct and if the chemical structure is the correct one of if the structural identifiers are wrong ?

The problem is the same for other entries like FIO00619, JP000136, FIO00623... where the chemical structure is not correct compared to the stereoconfiguration at the origin of InChIKey CWVRJTMFETXNAD-JUHZACGLSA-N. This InChIKey requires the definition of the 4 chiral carbons on the ring. Please see:

ChEBI: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:16112 CHEMBL: https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL284616/

tsufz commented 5 years ago

Well, another good example why MassBank meta data needs curation. The people frequently approach us now and this is a good sign that the community is interested in MassBank. However, if errors are not handled, the people will loose reliability. We are on a good way.

schymane commented 5 years ago

Next answer, plus a list of affected identifiers. @egonw is following this up on the Wikidata side, @meier-rene @Treutler we will need to follow this up on the MassBank side to address the immediate issue, plus add some ideas how to catch these cases in the validator. I think we may be able to do this with checking identifiers for consistency and flagging clashes? https://github.com/MassBank/MassBank-web/issues/158

As we agree about the problem of the chemical structure and the structural identifiers (mainly InChIKey and InChI) I can provide a full list of entries of MassBank to check: I am curating chemicals entries in Wikidata and I found that somebody uploaded all MAssBank entries with InChIKey = CWVRJTMFETXNAD-JUHZACGLSA-N to the wrong item. I don't check all entries mentioned on that page https://www.wikidata.org/wiki/Q27167119 (scroll down to find the Mass Bank identifiers) but I think that most of entries have no defined chiral centers and should have the InChIKey = CWVRJTMFETXNAD-UHFFFAOYSA-N according to the chemical structure.

The list:

JP000136 FIO00618 FIO00619 FIO00620 FIO00621 FIO00622 FIO00623 FIO00624 FIO00625 FIO00626 FIO00627 PB005541 PB006181 PB006182 KO000466 KO000467 KO000468 KO000469 KO000470 KO002577 KO002578 KO002579 KO002580 KO002581 KO008922 KO008923 OUF00135 OUF00136

egonw commented 5 years ago

I want to stress that this is not caused by our data import into Wikidata, not by MassBank. This examples is caused by an merger of two Wikidata items with different InChIKeys. I'm still exploring how this happened, as the person who did it is an experience chemist. These things do happen because of inconsistencies in Wikipedia and if you clean them, it can have downstream effects that are not always easy to detect (without automated, regular tests).

schymane commented 5 years ago

So, if this is not caused by problems on the MassBank side, we just need to double-check that these records have structural identifiers that are consistent within themselves (https://github.com/MassBank/MassBank-web/issues/158#issuecomment-494272585), and if so, we close the issue our side. Do I understand that correctly?

meier-rene commented 5 years ago

I don't exactly understand the Wikidata part, but I understand that the current MassBank data might produce inconsistencies in external repositories because its already inconsistent within MassBank. In this particular case the image of the structure is inconsistent with the structure in the InChI. The image is drawn from the SMILES field and this does not define trans double bonds as depicted. On the other hand the InChI defines a trans double bond.

Summary: We have two sources of chemical structures, InChI and SMILES, in our record files and they are not always consistent. I have code for the validator (#158) but its not activated because we have currently 10026 records with this kind of inconsistencies. I can not think of an automatic procedure to fix this at the moment.

schymane commented 5 years ago

How many unique InChIs are associated with the 10026 records? The useful breakdown would be (1) how many unique InChIKeys and (2) how many unique InChIKey first blocks ... because from the number of 10,026 this sounds incredibly large, but there are surely at least an order of magnitude (hopefully two) fewer chemicals associated with this number of records? For my own curiosity it would also be useful which databases are the main sources of these errors to see if we have anything systematic ...

meier-rene commented 5 years ago

I have fixed the inconsistencies for this particular compound. Numbers for all inconsistencies will follow.

meier-rene commented 5 years ago

Here are some numbers: We have 3351 unique InChI keys and we have 2964 unique InChI keys first block with inconsistencies.

And here is a listing of inconsistencies by databases: 202 BS 3 Boise_State_Univ 174 Kyoto_Univ 225 MPI_for_Chemical_Ecology 75 Univ_Connecticut 349 Eawag 239 PFOS_research_group 199 Fiocruz 193 Fukuyama_Univ 41 GL_Sciences_Inc 14 JEOL_Ltd 2021 Fac_Eng_Univ_Tokyo 167 NAIST 1039 Keio_Univ 62 Kazusa 5 Osaka_MCHRI 31 MSSJ 70 Metabolon 35 NaToxAq 4 RIKEN_NPDepo 459 Nihon_Univ 147 Osaka_Univ 179 IPB_Halle 742 RIKEN 6 CASMI_2012 12 Tottori_Univ 171 Univ_Toyama 26 UOEH 3 UPAO 2312 Chubu_Univ 793 Waters

Main source of inconsistency is the usage of SMILES without sterochemistry.

meier-rene commented 5 years ago

And one last fact: we don't have any inconsistencies in the connection table. Only stereochemical information differ between SMILES and InChI.

schymane commented 5 years ago

Now I'm confused. Can we get a table of MassBank Accession ID, CH$NAME, SMILES, InChI and InChIKey fields in the records, as well as the corresponding InChIKeys calculated from the SMILES and from the InChI fields (as well as the key in the records)? I do not quite understand how this happens e.g. for the Eawag records where the InChIs should be systematically calculated from the SMILES within RMassBank ...