ebi-chebi / ChEBI

Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds.
https://www.ebi.ac.uk/chebi
Creative Commons Attribution 4.0 International
42 stars 10 forks source link

Possible synonym formatting error for CHEBI:195584? #4428

Closed christabone closed 1 year ago

christabone commented 1 year ago

Hi ChEBI folks,

One of our ETL pipelines at alliancegenome.org just started failing recently and we traced it down to a weird synonym format in the Sept 1st release of the ChEBI ontology. More specifically, this term:

[Term]
id: CHEBI:195584
name: Arginylthreonine
subset: 2_STAR
synonym: "\"(2S,3R)-2-[[(2S)-2-amino-5-(diaminomethylideneamino)pentanoyl]amino]-3-hydroxybutanoic acid\"" EXACT IUPAC_NAME [SUBMITTER]
property_value: http://purl.obolibrary.org/obo/chebi/charge "0" xsd:string
property_value: http://purl.obolibrary.org/obo/chebi/monoisotopicmass "275.15935" xsd:string
property_value: http://purl.obolibrary.org/obo/chebi/mass "275.309" xsd:string
property_value: http://purl.obolibrary.org/obo/chebi/formula "C10H21N5O4" xsd:string
property_value: http://purl.obolibrary.org/obo/chebi/inchi "InChI=1S/C10H21N5O4/c1-5(16)7(9(18)19)15-8(17)6(11)3-2-4-14-10(12)13/h5-7,16H,2-4,11H2,1H3,(H,15,17)(H,18,19)(H4,12,13,14)/t5-,6+,7+/m1/s1" xsd:string
property_value: http://purl.obolibrary.org/obo/chebi/smiles "O[C@@H]([C@H](NC(=O)[C@@H](N)CCCN=C(N)N)C(O)=O)C" xsd:string
property_value: http://purl.obolibrary.org/obo/chebi/inchikey "XNSKSTRGQIPTSE-VQVTYTSYSA-N" xsd:string
xref: Chemspider:128430686 
xref: HMDB:HMDB0028719 
is_a: CHEBI:16670

The synonym starts with a strange quote-backslash-quote-thing which we think might be an error? Would anyone at ChEBI have a moment to check if this is the case?

If it's a legitimate synonym we will work on our end to fix our parser.

Thanks for your time!

christabone commented 1 year ago

It looks like there are several similar instances of synonyms in the file. Is this a recent change? Curious as to why we haven't had parsing issues with these entries in the past...

grep 'synonym: "\\"' chebi.obo 
synonym: "\"(2S,3R)-2-[[(2S)-2-amino-5-(diaminomethylideneamino)pentanoyl]amino]-3-hydroxybutanoic acid\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"(2S)-2,6-diaminononanedioic acid\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"dimethyl (2R)-pyrrolidine-1,2-dicarboxylate\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"phosphono (2S)-2,6-diaminohexanoate\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"(5R)-5-amino-4,8-dioxo-1,3,2-dioxazocane-2-carboxamide\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"pyridine-2,3-diamine\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"(3S)-3,17-dihydroxy-3-[(trimethylazaniumyl)methyl]heptadeca-4,6-dienoate\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"13-(3,4-dimethyl-5-pentyluran-2-yl)tridecanoic acid\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"(9R,10S)-9,10,16-trihydroxyhexadecanoic acid\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"(Z)-2-cyano-3-(3,4-dihydroxy-5-nitrophenyl)-N,N-diethylprop-2-enamide\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"[(2R,3R,4S,5S,6R)-2-[(1R,2Z,3S,4R,5S)-2-(cyanomethylidene)-3-hydroxy-4,5-dimethoxycyclohexyl]oxy-4,5-dihydroxy-6-(hydroxymethyl)oxan-3-yl] (Z)-3-(4-hydroxy-3-methoxyphenyl)prop-2-enoate\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"N-[(E,2S,3R)-1,3-dihydroxyoctadec-4-en-2-yl]ormamide\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"(4aS,5aS,12aR)-7-chloro-4-(dimethylamino)-1,6,10,11,12a-pentahydroxy-3,12-dioxo-4a,5,5a,6-tetrahydro-4H-tetracene-2-carboxamide\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"2-(5-hydroxy-4a-methyl-8-methylidene-1,2,3,4,5,8a-hexahydronaphthalen-2-yl)prop-2-enoic acid\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"[2-[2-(3,4-dihydroxyphenyl)-5,7-dihydroxy-4-oxochromen-8-yl]-4,5-dihydroxy-6-(hydroxymethyl)oxan-3-yl] acetate\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"5-(2-chloroethyl)-1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)oxolan-2-yl]pyrimidine-2,4-dione\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"4-amino-5-chloro-1-[(2R,4S,5R)-4-luoro-5-(hydroxymethyl)oxolan-2-yl]pyrimidin-2-one\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"2-(ethylamino)-9-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)oxolan-2-yl]-1H-purin-6-one\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"(2R,3S,5R)-5-(6-aminopurin-9-yl)-2-methyloxolan-3-ol\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"2-amino-7-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)oxolan-2-yl]-3H-pyrrolo[2,3-d]pyrimidin-4-one\"" EXACT IUPAC_NAME [SUBMITTER]
synonym: "\"4-amino-1-[(2R,3R,4S,5R)-5-azido-3,4-dihydroxy-5-(hydroxymethyl)oxolan-2-yl]pyrimidin-2-one\"" EXACT IUPAC_NAME [SUBMITTER]
amalik01 commented 1 year ago

All of these entries were deposited into ChEBI by the MetaboLights database. Something must have gone wrong when they submitted these entries. The quotation marks ('') should not be present in any of the synonyms. I will try and fix these entries so the issue does not arise in next months release.

christabone commented 1 year ago

Thanks @amalik01 !

I also came across another instance of an escaped quote in a synonym. Not sure if this is intentional but just wanted to let you know. I'm not sure how many of these might exist:

[Term]
id: CHEBI:76100
name: 1-O-[6-O-(4-pyridylcarbamoyl)-alpha-D-galactopyranosyl]-N-hexacosanoylphytosphingosine
subset: 3_STAR
def: "A glycophytoceramide having a 6-O-(4-pyridylcarbamoyl)-alpha-D-galactopyranosyl residue at the O-1 position and an hexacosanoyl group attached to the nitrogen." []
synonym: "N-{(2S,3S,4R)-3,4-dihydroxy-1-[6-O-(pyridin-4-ylcarbamoyl)-alpha-D-galactopyranosyloxy]octadecan-2-yl}hexacosanamide" EXACT IUPAC_NAME [IUPAC]
synonym: "alpha-GalCer-6\"-(4-pyridyl)carbamate" RELATED [ChEBI]
synonym: "alpha-GalCer-6\"-(pyridin-4-yl)carbamate" RELATED [ChEBI]
synonym: "PyrC-alpha-GalCer" RELATED [ChEBI]
property_value: http://purl.obolibrary.org/obo/chebi/mass "978.43110" xsd:string
property_value: http://purl.obolibrary.org/obo/chebi/formula "C56H103N3O10" xsd:string
property_value: http://purl.obolibrary.org/obo/chebi/monoisotopicmass "977.76435" xsd:string
property_value: http://purl.obolibrary.org/obo/chebi/charge "0" xsd:string
property_value: http://purl.obolibrary.org/obo/chebi/inchikey "GONJMTFPPNECAU-VEDNRHISSA-N" xsd:string
property_value: http://purl.obolibrary.org/obo/chebi/smiles "CCCCCCCCCCCCCCCCCCCCCCCCCC(=O)N[C@@H](CO[C@H]1O[C@H](COC(=O)Nc2ccncc2)[C@H](O)[C@H](O)[C@H]1O)[C@H](O)[C@H](O)CCCCCCCCCCCCCC" xsd:string
property_value: http://purl.obolibrary.org/obo/chebi/inchi "InChI=1S/C56H103N3O10/c1-3-5-7-9-11-13-15-17-18-19-20-21-22-23-24-25-26-27-29-31-33-35-37-39-50(61)59-47(51(62)48(60)38-36-34-32-30-28-16-14-12-10-8-6-4-2)44-67-55-54(65)53(64)52(63)49(69-55)45-68-56(66)58-46-40-42-57-43-41-46/h40-43,47-49,51-55,60,62-65H,3-39,44-45H2,1-2H3,(H,59,61)(H,57,58,66)/t47-,48+,49+,51-,52-,53-,54+,55-/m0/s1" xsd:string
xref: PMID:23960235 {source="Europe PMC"}
xref: PDBeChem:1LA 
is_a: CHEBI:59389
amalik01 commented 1 year ago

Hi @christabone

I have now fixed most of these issues. The changes will be visible in next months release.

Regarding the synonym (α-GalCer-6"-(4-pyridyl)carbamate) in CHEBI:76100. Unfortunately at the moment, we do not have a special character for the double prime symbol (https://en.wikipedia.org/wiki/Prime_(symbol)) in ChEBI and therefore have to use quotation marks to represent this symbol in the synonyms. This will need fixing at some point in the near future.