biothings / mychem.info

MyChem.info: A BioThings API for chemical/drug annotations
http://mychem.info
Apache License 2.0
15 stars 12 forks source link

add drug description to mychem.info from PubChem #159

Closed newgene closed 10 months ago

newgene commented 1 year ago

PubChem looks like has informative description from their website:

https://pubchem.ncbi.nlm.nih.gov/compound/16051951

Description
Clindamycin hydrochloride is a S-glycosyl compound. ChEBI Clindamycin Hydrochloride is the hydrochloride salt form of clindamycin, a semi-synthetic, chlorinated broad spectrum antibiotic produced by chemical modification of lincomycin. Clindamycin hydrochloride is used as a solid in capsules. NCI Thesaurus (NCIt) An antibacterial agent that is a semisynthetic analog of LINCOMYCIN. Medical Subject Headings (MeSH)

Let's see if we can include these descriptions (or one of them) from PubChem or other sources in MyChem.info.

https://mychem.info/v1/query?q=16051951

colleenXu commented 1 year ago

perhaps related to the annotation service effort? https://github.com/biothings/biothings_explorer/issues/344#issuecomment-1583034829

erikyao commented 1 year ago

Description Text

The description text shown in the Title and Summary section of a Compound Summary page is actually the description of the mapped entities.

E.g. on the CID:16051951 page, the description Clindamycin hydrochloride is a S-glycosyl compound., is actually from its mapped CHEBI:176915 page.

The same pattern applies to mapped NCIt and MeSH.

Mapping

Therefore if we can find all the CID-ChEBI, CID-NCIt, and CID-MeSH mappings, we can fetch all the description texts.

Currently our plugin uses the XML files from pubchem/Compound/CURRENT-Full/XML folder; however, those XML files do not contain the mappings.

In the pubchem/Compound/Extras folder, there SHOULD be a SID-Map.gz file. According to the README:

This is a listing of all (live) SIDs with their source names and registry identifiers, and the standardized CID if present. It is a gzipped text file where each line contains at least three columns: SID, tab, source name, tab, registry identifier; then a fourth column of tab, CID if there is a standardized CID for the given SID.

However, this file is missing in the folder. Maybe we can ask NIH to provide it.

Also note that the CID-MeSH file does not provide MeSH IDs.

Source XML data for CID:16051951

Just FYI:

Click me ```xml 16051951 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 17 17 16 8 8 8 8 8 7 7 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 4 4 5 5 6 6 7 7 8 9 9 9 10 10 10 11 11 11 11 12 12 12 13 13 14 14 15 15 15 16 16 17 17 18 18 18 19 19 21 21 22 23 23 24 24 24 25 25 25 26 26 26 27 27 27 28 28 28 23 62 22 28 15 22 17 53 19 54 21 58 20 12 14 24 16 20 42 13 14 18 29 13 20 30 31 32 33 34 16 17 35 23 36 19 37 25 38 39 21 40 22 41 43 26 44 45 46 47 27 48 49 50 51 52 55 56 57 59 60 61 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 13 18 14 29 2 1 12 9 13 20 30 1 1 15 4 17 16 35 2 1 16 10 15 23 36 1 1 17 5 15 19 37 1 1 19 6 17 21 40 1 1 21 7 22 19 41 2 1 22 3 4 21 43 1 1 23 1 26 16 44 2 1 1 5 255 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 4.4837 0 9.6799 7.9478 6.2158 7.9478 9.6799 4.4837 4.5407 6.2158 5.8497 5.3497 6.1588 4.8497 7.0818 6.2158 7.0818 6.4375 7.9478 5.3497 8.8138 8.8138 5.3497 3.5896 6.0308 5.3497 6.6186 9.6799 6.4621 4.7973 6.4688 6.7251 4.9145 4.2433 7.8179 6.7527 6.5448 6.8682 6.9515 7.4109 8.8138 6.7527 8.8138 4.8128 3.3981 3 3.7812 5.6001 5.5168 4.7297 5.3497 5.9697 6.2158 8.4847 7.1202 6.983 6.117 10.2168 10.2999 9.6799 9.0599 1 3.31 4.941 3.31 3.31 1.31 0.31 1.31 4.31 6.3978 4.31 7.3488 5.81 6.3978 7.3488 2.81 3.31 1.81 8.1579 1.31 4.81 1.81 2.81 2.81 6.0888 9.0714 1.81 9.8804 4.31 7.2518 5.5285 5.8608 6.65 7.9654 7.4778 2.385 3.62 2.12 7.7119 8.5046 1 1.19 4.62 3.43 2.5 6.6784 5.8972 5.4991 9.5174 8.7247 1.81 1.19 1.81 0.69 0 9.516 10.382 10.2448 1.62 4.31 4.93 4.31 4.941 6 5 6 5 5 5 6 6 6 11 12 15 16 17 19 21 22 23 18 20 35 10 5 6 7 3 1 0 Compound Canonicalized 5 2021.05.07 1 Compound Complexity 7 E_COMPLEXITY 3.4.8.18 Cactvs Xemistry GmbH 2021.05.07 502 Count Hydrogen Bond Acceptor 5 E_NHACCEPTORS 3.4.8.18 Cactvs Xemistry GmbH 2021.05.07 7 Count Hydrogen Bond Donor 5 E_NHDONORS 3.4.8.18 Cactvs Xemistry GmbH 2021.05.07 5 Count Rotatable Bond 5 E_NROTBONDS 3.4.8.18 Cactvs Xemistry GmbH 2021.05.07 7 Fingerprint SubStructure Keys 16 extended 2 E_SCREEN 3.4.8.18 Cactvs Xemistry GmbH 2021.05.07 00000371F07B3800460000000000000000000000000160000000240000000000000000000000001E06100800000D3FE5C046820003C00608080001101000000000000010000081880200035012218020574000071600930001F8D9A38E00000000000000000000000000000000000000000000 IUPAC Name Allowed 1 2.7.0 Lexichem TK OpenEye Scientific Software 2021.05.07 (2S,4R)-N-[(1S,2S)-2-chloro-1-[(2R,3R,4S,5R,6R)-3,4,5-trihydroxy-6-methylsulfanyl-tetrahydropyran-2-yl]propyl]-1-methyl-4-propyl-pyrrolidine-2-carboxamide;hydrochloride IUPAC Name CAS-like Style 1 2.7.0 Lexichem TK OpenEye Scientific Software 2021.05.07 (2S,4R)-N-[(1S,2S)-2-chloro-1-[(2R,3R,4S,5R,6R)-3,4,5-trihydroxy-6-(methylthio)-2-oxanyl]propyl]-1-methyl-4-propyl-2-pyrrolidinecarboxamide;hydrochloride IUPAC Name Markup 1 2.7.0 Lexichem TK OpenEye Scientific Software 2021.05.07 (2<I>S</I>,4<I>R</I>)-<I>N</I>-[(1<I>S</I>,2<I>S</I>)-2-chloro-1-[(2<I>R</I>,3<I>R</I>,4<I>S</I>,5<I>R</I>,6<I>R</I>)-3,4,5-trihydroxy-6-methylsulfanyloxan-2-yl]propyl]-1-methyl-4-propylpyrrolidine-2-carboxamide;hydrochloride IUPAC Name Preferred 1 2.7.0 Lexichem TK OpenEye Scientific Software 2021.05.07 (2S,4R)-N-[(1S,2S)-2-chloro-1-[(2R,3R,4S,5R,6R)-3,4,5-trihydroxy-6-methylsulfanyloxan-2-yl]propyl]-1-methyl-4-propylpyrrolidine-2-carboxamide;hydrochloride IUPAC Name Systematic 1 2.7.0 Lexichem TK OpenEye Scientific Software 2021.05.07 (2S,4R)-N-[(1S,2S)-2-chloranyl-1-[(2R,3R,4S,5R,6R)-6-methylsulfanyl-3,4,5-tris(oxidanyl)oxan-2-yl]propyl]-1-methyl-4-propyl-pyrrolidine-2-carboxamide;hydrochloride IUPAC Name Traditional 1 2.7.0 Lexichem TK OpenEye Scientific Software 2021.05.07 (2S,4R)-N-[(1S,2S)-2-chloro-1-[(2R,3R,4S,5R,6R)-3,4,5-trihydroxy-6-(methylthio)tetrahydropyran-2-yl]propyl]-1-methyl-4-propyl-pyrrolidine-2-carboxamide;hydrochloride InChI Standard 1 1.0.6 InChI iupac.org 2021.05.07 InChI=1S/C18H33ClN2O5S.ClH/c1-5-6-10-7-11(21(3)8-10)17(25)20-12(9(2)19)16-14(23)13(22)15(24)18(26-16)27-4;/h9-16,18,22-24H,5-8H2,1-4H3,(H,20,25);1H/t9-,10+,11-,12+,13-,14+,15+,16+,18+;/m0./s1 InChIKey Standard 1 1.0.6 InChI iupac.org 2021.05.07 AUODDLQVRAJAJM-XJQDNNTCSA-N Mass Exact 1 2.1 PubChem ncbi.nlm.nih.gov 2021.05.07 460.1565488 Molecular Formula 1 2.1 PubChem ncbi.nlm.nih.gov 2021.05.07 C18H34Cl2N2O5S Molecular Weight 1 2.1 PubChem ncbi.nlm.nih.gov 2021.05.07 461.4 SMILES Canonical 1 2.3.0 OEChem OpenEye Scientific Software 2021.05.07 CCCC1CC(N(C1)C)C(=O)NC(C2C(C(C(C(O2)SC)O)O)O)C(C)Cl.Cl SMILES Isomeric 1 2.3.0 OEChem OpenEye Scientific Software 2021.05.07 CCC[C@@H]1C[C@H](N(C1)C)C(=O)N[C@@H]([C@@H]2[C@@H]([C@@H]([C@H]([C@H](O2)SC)O)O)O)[C@H](C)Cl.Cl Topological Polar Surface Area 7 E_TPSA 3.4.8.18 Cactvs Xemistry GmbH 2021.05.07 128 Weight MonoIsotopic 1 2.1 PubChem ncbi.nlm.nih.gov 2021.05.07 460.1565488 28 9 9 0 0 0 0 0 2 -1 ```

This section can be pulled with the following command:

zcat Compound_016000001_016500000.xml.gz | sed -n '49819264,49820357p;49820358q'
newgene commented 12 months ago

We might end up don't have to do extra ID mappings, since we already have the mappings to CHEBI, NCIT and UMLS.

For the particular case above:

https://mychem.info/v1/chem/AUODDLQVRAJAJM-XJQDNNTCSA-N?fields=unii.ncit,chebi.id,umls.mesh

We don't have mesh ID for this drug/chemical, but some objects do have like "Hydromorphone":

https://mychem.info/v1/chem/WVLOADHCBXTIJK-YNHQPCIGSA-N?fields=unii.ncit,chebi.id,umls.mesh

Then we can get their descriptions from our CHEBI and NCIT APIs:

https://biothings.ncats.io/chebi/chemical/CHEBI:176915?fields=def (checked that don't have def field in the latest CHEBI obo file) https://biothings.ncats.io/chebi/chemical/CHEBI:5790?fields=def (this one does have the def field)

https://biothings.ncats.io/ncit/node/NCIT:C47977?fields=def https://biothings.ncats.io/ncit/node/NCIT:C62034?fields=def

Not sure if we can get mesh description easily based on the MESH id, but we can go with CHEBI and NCIT first.

newgene commented 12 months ago

This mychem query can be useful to list all hits contains all three IDs:

https://mychem.info/v1/query?q=_exists_:umls.mesh%20AND%20_exists_:chebi.id%20AND%20_exists_:unii.ncit&fields=umls,chebi,unii

newgene commented 12 months ago

As the next step, we probably don't need to do anything yet at MyChem.info side. We can implement the logic at the Translator Annotator Service side, using the existing mapped CHEBI and NCIT IDs to retrieve their descriptions. We will then evaluate how good they are, if we need to improve the mapping at MyChem.info (e.g. using PubChem's extra mapping file) and also how we can get the MESH description.

Later, it will still be good to include these descriptions directly in MyChem.info.

newgene commented 10 months ago

Drug/Chemical description:

Here are a few examples from the Translator Annotator service, which uses multiple BioThings APIs, including MyChem.info, to annotate chemicals/drugs:

Closing this issue now, since we don't need to include additional chemical/drug description fields for now.