MassBank / MassBank-data

Official repository of open data MassBank records
76 stars 60 forks source link

Add DTXSIDs to all MassBank records with InChIKey match #66

Closed schymane closed 5 years ago

schymane commented 5 years ago

@meier-rene @Treutler the EPA have set up a basic service that should allow retrieval of DTXSIDs by InChIKey, can you look into implementing this on the database end to add DTXSIDs to all records with matching entries for now, I will post a separate issue to get this into RMassBank and linked up in MassBank-web. It's already in our Record format as CH$LINK: COMPTOX DTXSID50274017 (https://github.com/MassBank/MassBank-web/blob/master/Documentation/MassBankRecordFormat.md)

https://actorws.epa.gov/actorws/chemIdentifier/v01/resolve?identifier=IKHGUXGNUITLKF-UHFFFAOYSA-N https://actorws.epa.gov/actorws/chemIdentifier/v01/resolve.json?identifier=IKHGUXGNUITLKF-UHFFFAOYSA-N https://actorws.epa.gov/actorws/chemIdentifier/v01/resolve.xml?identifier=IKHGUXGNUITLKF-UHFFFAOYSA-N

Any feedback re service to @ChemConnector

Thanks!

meier-rene commented 5 years ago

I will take care of this.

And I would like to give a short update about a related topic: I curated all records with any structural information available to contain proper InChI and InChI-Keys. There are just 900 records left which dont have structural information, just chemical names.

schymane commented 5 years ago

Great! Can you post a list somewhere of the 900, with basic details like name, accession etc? Some of them are "tentative", but I am not sure we have that many ... I would be curious ... Thanks!

meier-rene commented 5 years ago

noStructure.txt The list of all records without a Structure given.

schymane commented 5 years ago

Oh interesting ... so the EawagAdditional are ones that almost certainly don't have a structure because they are tenative records ... but I see a lot from BS, Fac_Eng_Univ_Tokyo (major culprit) and even IPB Halle! @sneumann should be able to comment about the latter ... do you see a systematic issue (one critical identifier missing that we could fill in with other information available) with BS and Fac_Eng_Univ_Tokyo?

meier-rene commented 5 years ago

There are roughly 60 records with other database identifier, like CAS, which I could use to retrieve proper chemical information. The remaining records have only chemical names. Needs manual lookup and might be unsuccessful in some cases. This will take some time...

Different topic: Please could someone explain the difference between DTXCID and DTXSID? The code for adding COMPTOX id is nearly finished.

schymane commented 5 years ago

C = compound/chemical and S = substance. The "C" entries are the unique chemical (~~ "MS-ready" forms (put simply)) and the "S" entries are the official database entry. Effectively we should always use and link via the substance identifier, the DTXSID

image

image

Check out infoboxes here (@ChemConnector note inconsistencies in the DTXCID!) https://comptox.epa.gov/dashboard/dsstoxdb/batch_search

meier-rene commented 5 years ago

Sorry, didnt understand this concept.

On pubchem we have SID which is something like the label on a bottle with chemicals and could potentially be a mixture and we have CID which is a uniqe compound which is represented by exactly one formula(like you would draw on a paper).

Thats why more questions: Does this mean that there might be several DTXSID for one InChI-Key? Is there a 1 to n relation between DTXCID and DTXSID like in pubchem?

schymane commented 5 years ago

As far as I'm aware it's a one DTXSID per InChIKey. The service should return us one DTXSID for one InChIKey request and this is what @ChemConnector asked us to do, use InChIKey to DTXSID to add these identifiers to MassBank .. (therefore I'm assuming this is the most robust way in his opinion and from my experience, I'd agree)

One DTXSID may have multiple DTXCIDs associated with it. It's a bit different to the PubChem construct. imho we should not yet try mapping on DTXCIDs as they don't have the full functionality associated with them like the DTXSIDs, until recently they were hidden entirely.

Some examples: https://comptox.epa.gov/dashboard/dsstoxdb/results?search=nicotine https://comptox.epa.gov/dashboard/dsstoxdb/ms_ready_mixture?cid=28128

https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID10858175 This one has two DTXCIDs associated with it: image

meier-rene commented 5 years ago

I have created a program which can add these identifier with the help of the InChI-key to DTXSID resolver and have processed all records. We have now 39962 outlinks in place. This program can be executed on all new records and also on a regular basis on the existing records. I think this one can be closed.

meier-rene commented 5 years ago

Reopen until #68 is solved.

schymane commented 5 years ago

@ChemConnector has added additional services that might be of interest. NOTE that these actor-based web services will be switched off next year and replaced with CompTox ones once they are up and running.

Data Source: dsstox v02

https://ni.epa.gov/actorws/dsstox/v02/msready?identifier=80-05-7 https://ni.epa.gov/actorws/dsstox/v02/msready.json?identifier=80-05-7 https://ni.epa.gov/actorws/dsstox/v02/msready.xml?identifier=80-05-7

https://ni.epa.gov/actorws/dsstox/v02/msready?identifier=DTXCID60513 https://ni.epa.gov/actorws/dsstox/v02/msready.json?identifier=DTXCID60513 https://ni.epa.gov/actorws/dsstox/v02/msready.xml?identifier=DTXCID60513

https://ni.epa.gov/actorws/dsstox/v02/msready?identifier=UVOFGKIRTCCNKG-UHFFFAOYSA-N https://ni.epa.gov/actorws/dsstox/v02/msready.json?identifier=UVOFGKIRTCCNKG-UHFFFAOYSA-N https://ni.epa.gov/actorws/dsstox/v02/msready.xml?identifier=UVOFGKIRTCCNKG-UHFFFAOYSA-N

https://ni.epa.gov/actorws/dsstox/v02/qsar?identifier=80-05-7 https://ni.epa.gov/actorws/dsstox/v02/qsar.json?identifier=80-05-7 https://ni.epa.gov/actorws/dsstox/v02/qsar.xml?identifier=80-05-7

https://ni.epa.gov/actorws/dsstox/v02/qsar?identifier=DTXCID60513 https://ni.epa.gov/actorws/dsstox/v02/qsar.json?identifier=DTXCID60513 https://ni.epa.gov/actorws/dsstox/v02/qsar.xml?identifier=DTXCID60513

https://ni.epa.gov/actorws/dsstox/v02/qsar?identifier=UVOFGKIRTCCNKG-UHFFFAOYSA-N https://ni.epa.gov/actorws/dsstox/v02/qsar.json?identifier=UVOFGKIRTCCNKG-UHFFFAOYSA-N https://ni.epa.gov/actorws/dsstox/v02/qsar.xml?identifier=UVOFGKIRTCCNKG-UHFFFAOYSA-N

image

The hyperlinks to MS Ready and QSAR Ready forms are added the resolver service.

image

schymane commented 5 years ago

Note that if the cause of the problem is the web services return also up to Level 6, if the "curation level" would be in the data retrieved, we could proactively fix our end by only including DTXSIDs if the level is 5 or lower. I can't see that this information is included yet tho, just following the links above - although I thought this was part of the plan @ChemConnector ?