Massive description and unknown category for one node

dkoslicki commented 3 years ago

In the result https://arax.ncats.io/api/arax/v1.0/response/4963, there is a node:

"CHEBI:16108": {
          "attributes": [
            {
              "name": "description",
              "source": null,
              "type": "biolink:Unknown",  #<--- note the unknown type
              "url": null,
              "value": "Dihydroxyacetone phosphate, also known as 3-phosphate, dihydroxyacetone or 3-hydroxy-2-oxopropyl phosphate, belongs to the class of organic compounds known as monosaccharide phosphates. These are monosaccharides comprising a phosphated group linked to the carbohydrate unit. Dihydroxyacetone phosphate is soluble (in water) and a moderately acidic compound (based on its pKa). Dihydroxyacetone phosphate has been detected in multiple biofluids, such as saliva and blood. Within the cell, dihydroxyacetone phosphate is primarily located in the peroxisome, mitochondria and cytoplasm. Dihydroxyacetone phosphate exists in all living organisms, ranging from bacteria to humans. In humans, dihydroxyacetone phosphate is involved in cardiolipin biosynthesis CL(i-13:0/i-21:0/a-17:0/i-14:0) pathway, cardiolipin biosynthesis CL(i-14:0/a-13:0/i-19:0/a-25:0) pathway, cardiolipin biosynthesis CL(i-12:0/i-13:0/i-17:0/i-12:0) pathway, and cardiolipin biosynthesis CL(a-13:0/18:2(9Z,11Z)/i-20:0/i-22:0) pathway. Dihydroxyacetone phosphate is also involved in several metabolic disorders, some of which include de novo triacylglycerol biosynthesis TG(8:0/a-21:0/13:0) pathway, de novo triacylglycerol biosynthesis TG(16:0/20:5(5Z,8Z,11Z,14Z,17Z)/20:3(5Z,8Z,11Z)) pathway, de novo triacylglycerol biosynthesis TG(i-20:0/i-21:0/19:0) pathway, and de novo triacylglycerol biosynthesis TG(i-22:0/17:0/i-14:0) pathway. Outside of the human body, dihydroxyacetone phosphate can be found in a number of food items such as towel gourd, boysenberry, jujube, and prunus (cherry, plum). This makes dihydroxyacetone phosphate a potential biomarker for the consumption of these food products. Dihydroxyacetone phosphate is an important intermediate in lipid biosynthesis and in glycolysis.; Dihydroxyacetone phosphate is an important intermediate in lipid biosynthesis and in glycolysis.; Dihydroxyacetone phosphate is an important intermediate in lipid biosynthesis and in glycolysis.; Dihydroxyacetone phosphate is an important intermediate in lipid biosynthesis and in glycolysis.; Dihydroxyacetone phosphate is an important intermediate in lipid biosynthesis and in glycolysis.; Dihydroxyacetone phosphate is an important intermediate in lipid biosynthesis and in glycolysis.; Dihydroxyacetone phosphate, also known as 3-phosphate, dihydroxyacetone or 3-hydroxy-2-oxopropyl phosphate, belongs to the class of organic compounds known as monosaccharide phosphates. These are monosaccharides comprising a phosphated group linked to the carbohydrate unit. Dihydroxyacetone phosphate is soluble (in water) and a moderately acidic compound (based on its pKa). Dihydroxyacetone phosphate has been detected in multiple biofluids, such as saliva and blood. Within the cell, dihydroxyacetone phosphate is primarily located in the peroxisome, mitochondria and cytoplasm. Dihydroxyacetone phosphate exists in all living organisms, ranging from bacteria to humans. In humans, dihydroxyacetone phosphate is involved in cardiolipin biosynthesis CL(i-13:0/i-21:0/a-17:0/i-14:0) pathway, cardiolipin biosynthesis CL(i-14:0/a-13:0/i-19:0/a-25:0) pathway, cardiolipin biosynthesis CL(i-12:0/i-13:0/i-17:0/i-12:0) pathway, and cardiolipin biosynthesis CL(a-13:0/18:2(9Z,11Z)/i-20:0/i-22: <snip>

Which I have snipped because the description goes on for pages and pages. Curiously, I can't find this node in KG2C (still searching in Neo4J after ~10min). Any idea what is going on with this node?

saramsey commented 3 years ago

FWIW, I am not seeing this issue in KG2.5.2:

saramsey commented 3 years ago

Hi @kvarforl can you please check KG2C?

Also if a single-node lookup query is taking more than 10 minutes, I wonder if that Neo4j server either needs to be restarted or is missing a node index?

kvarforl commented 3 years ago

Hmm, interesting. In KG2.5.2C, I found that CHEBI curie in the synonyms list of MESH:D004099

match (n) where "CHEBI:16108" in n.equivalent_curies return n.name, n.id, n.equivalent_curies

n.name	n.id	n.equivalent_curies	n.category
"Dihydroxyacetone Phosphate"	"MESH:D004099"	["CHEBI:16108", "CHEMBL.COMPOUND:CHEMBL1161998", "DRUGBANK:DB04326", "HMDB:HMDB0001473", "KEGG:C00111", "MESH:D004099", "PathWhiz.Compound:1134", "PathWhiz.Compound:42631", "UMLS:C0012324"]	"biolink:ChemicalSubstance"

which has the same massive description. I'm not sure what KG2C does with descriptions of the nodes it clusters: does it aggregate them? That could explain the longness.

As for the unknown type, I think that the result you've included is showing the type for the description attribute as opposed to the node type itself. I believe the node type is biolink:ChemicalSubstance :)

You just have to scroll through the ridiculously long description or view the result in its json form to view the type of the node. (in my screenshot, the highlighted field is the ridiculously long description collapsed)

dkoslicki commented 3 years ago

Ah, thanks @kvarforl ! Maybe @amykglen or @edeutsch can comment about how the descriptions are coalesced.

amykglen commented 3 years ago

thanks, @kvarforl! yes, KG2c basically concatenates the 5 longest descriptions of the coalesced nodes (a silly heuristic - could be replaced with something smarter if we knew which upstream sources produce "better" descriptions). and in KG2c.5.2, this number has been downgraded to top 3...

it may be worth checking if any of the individual coalesced nodes have super long descriptions in the regular KG2. e.g., run this query on kg2-3-4 (the list of curies are the equivalent curies for "CHEBI:16108" in KG2c.3.4 ):

match (n) where n.id in ["UMLS:C0012324", "DRUGBANK:DB04326", "MESH:D004099", "CHEBI:16108", "CHEMBL.COMPOUND:CHEMBL1161998", "PathWhiz.Compound:1134", "PathWhiz.Compound:42631", "HMDB:HMDB0001473", "KEGG:C00111"] return n.id, n.name, n.description

(I'm trying to do this but it's been running for a few minutes - unusual...)

edeutsch commented 3 years ago

I suppose I would just suggest using the exact description of the leader of the group (i.e. the preferred CURIE). If that is null, then maybe go searching for the one longest. Although the longest ones are often stuffed with unsightly parsable metadata.

Your information explains why some of these descriptions are enormous! I don't think concatenating multiple descriptions is useful at this stage.

dkoslicki commented 3 years ago

Or perhaps concatenating up to some threshold. E.g. if the description is longer than X characters, just discard the rest (no one is going to read it anyway)

amykglen commented 3 years ago

cool - I think that makes sense, @edeutsch. I just made that change in 4a20008e3110d5f925fbd12698600859a175b033 so it'll be included in the rebuild of KG2c.5.2 (which is now underway). I made it use the leader's description, but if that's null it takes the longest description that's not over 10,000 characters (in an attempt to save us from massive descriptions)

I'm still suspicious that there's a ridiculously huge description in KG2 for one of the nodes mentioned above... this query (in kg2-3-4) shows it seems to be a PathWhiz node:

match (n) where n.id in ["UMLS:C0012324", "DRUGBANK:DB04326", "MESH:D004099", "CHEBI:16108", "CHEMBL.COMPOUND:CHEMBL1161998", "PathWhiz.Compound:1134", "PathWhiz.Compound:42631", "HMDB:HMDB0001473", "KEGG:C00111"] return n.id, n.name, size(n.description)

n.id	n.name	size(n.description)
"UMLS:C0012324"	"Dihydroxyacetone Phosphate"	136
"DRUGBANK:DB04326"	"Dihydroxyacetone phosphate"	267
"MESH:D004099"	"Dihydroxyacetone Phosphate"	136
"CHEBI:16108"	"dihydroxyacetone phosphate"	110
"CHEMBL.COMPOUND:CHEMBL1161998"	"3-hydroxy-2-oxopropyl hydrogen phosphate"	83
"PathWhiz.Compound:1134"	"Dihydroxyacetone phosphate"	3000000
"PathWhiz.Compound:42631"	"Glycerone phosphate"	325
"HMDB:HMDB0001473"	"Dihydroxyacetone phosphate"	220
"KEGG:C00111"	"Glycerone phosphate"	2504

3,000,000 characters seems quite large.

amykglen commented 3 years ago

whoops, didn't see @dkoslicki's comment there! true, we could concatenate up to a point. maybe we should go with that approach if the new one I put in place seems to be a bit sparse.

amykglen commented 3 years ago

well the description for the CHEBI:16108 node certainly is smaller in the new KG2c.5.2:

this strategy of using the leader's description unless it's null might require some revision though - looks like we're ending up with a lot of descriptions with only some UMLS semantic types in them. for example, here's acetaminophen:

but luckily this tweak can be done quite easily at any point (doesn't need to hold up rolling out KG2c.5.2 or anything)

edeutsch commented 3 years ago

one concern here is that RXNORM is the leader. instead CHEMBL.COMPOUND:CHEMBL112 should be. that seems like a bug

amykglen commented 3 years ago

hm, interesting. although it seems CHEMBL nodes also don't have very helpful descriptions (a random selection from KG2c here):

n.id	n.name	n.description
"CHEMBL.COMPOUND:CHEMBL2108130"	"PLANTAGO SEED"	"PLANTAGO SEED; MAX_FDA_APPROVAL_PHASE: 0"
"CHEMBL.COMPOUND:CHEMBL543875"	"(1H-Indol-2-ylmethyl)-(1-naphthalen-1-yl-ethyl)-amine hydrochloride"	"(1H-Indol-2-ylmethyl)-(1-naphthalen-1-yl-ethyl)-amine hydrochloride; FULL_MW:336.87; MAX_FDA_APPROVAL_PHASE: 0"
"CHEMBL.COMPOUND:CHEMBL2003746"	"SID437154"	"SID437154; FULL_MW:574.49; MAX_FDA_APPROVAL_PHASE: 0"
"CHEMBL.COMPOUND:CHEMBL407710"	"XERANTOLIDE"	"XERANTOLIDE; FULL_MW:246.31; MAX_FDA_APPROVAL_PHASE: 0"
"CHEMBL.COMPOUND:CHEMBL2108098"	"SOMAGREBOVE"	"SOMAGREBOVE; MAX_FDA_APPROVAL_PHASE: 0"
"CHEMBL.COMPOUND:CHEMBL2108158"	"WHITE LOTION"	"WHITE LOTION; MAX_FDA_APPROVAL_PHASE: 0"
"CHEMBL.COMPOUND:CHEMBL2111077"	"SINTROPIUM"	"SINTROPIUM; FULL_MW:310.50; MAX_FDA_APPROVAL_PHASE: 0"
"CHEMBL.COMPOUND:CHEMBL225986"	"(R)-2-hydroxysuccinic acid"	"(R)-2-hydroxysuccinic acid; FULL_MW:134.09; MAX_FDA_APPROVAL_PHASE: 0"
"CHEMBL.COMPOUND:CHEMBL2146144"	"ATROPINE OXIDE HYDROCHLORIDE"	"ATROPINE OXIDE HYDROCHLORIDE; FULL_MW:341.83; MAX_FDA_APPROVAL_PHASE: 0"
"CHEMBL.COMPOUND:CHEMBL194881"	"4-Dodecyl-phenol"	"4-Dodecyl-phenol; FULL_MW:262.44; MAX_FDA_APPROVAL_PHASE: 0"
"CHEMBL.COMPOUND:CHEMBL2108199"	"ERSOFERMIN"	"ERSOFERMIN; MAX_FDA_APPROVAL_PHASE: 0"

DRUGBANK preferred curies appear better though:

n.id	n.name	n.description
"DRUGBANK:DB02382"	"Namn"	"Nicotinic acid mononucleotide, also known as beta-nicotinate D-ribonucleotide or deamido-NMN, belongs to the class of organic compounds known as pentose phosphates. These are carbohydrate derivatives containing a pentose substituted by one or more phosphate groups. Nicotinic acid mononucleotide is slightly soluble (in water) and a moderately acidic compound (based on its pKa). Within the cell, nicotinic acid mononucleotide is primarily located in the cytoplasm, mitochondria and nucleus. Nicotinic acid mononucleotide exists in all eukaryotes, ranging from yeast to humans. Nicotinic acid mononucleotide participates in a number of enzymatic reactions. In particular, Nicotinic acid mononucleotide can be biosynthesized from nicotinate D-ribonucleoside through the action of the enzyme nicotinamide riboside kinase 1. Furthermore, Nicotinic acid mononucleotide can be converted into nicotinate D-ribonucleoside; which is catalyzed by the enzyme cytosolic purine 5'-nucleotidase. Furthermore, Nicotinic acid mononucleotide can be biosynthesized from quinolinic acid and phosphoribosyl pyrophosphate through its interaction with the enzyme nicotinate-nucleotide pyrophosphorylase [carboxylating]. Finally, Nicotinic acid mononucleotide can be converted into nicotinic acid adenine dinucleotide through the action of the enzyme nicotinamide/nicotinic acid mononucleotide adenylyltransferase. In humans, nicotinic acid mononucleotide is involved in the nicotinate and nicotinamide metabolism pathway."
"DRUGBANK:DB14960"	"Somatrogon"	"UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T125; UMLS Semantic Type: UMLS_STY:T109; Somatrogon is under investigation in clinical trial NCT02500316 (Long Term Follow up Study of Long-acting hGH (MOD-4023) in Growth Hormone Deficient Children)."
"DRUGBANK:DB10703"	"Trichoderma harzianum"	"UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T129; Trichoderma harzianum is a fungus which can provoke allergic reactions. Trichoderma harzianum extract is used in allergenic testing."
"DRUGBANK:DB10379"	"Dactylis glomerata pollen"	"UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T109; Dactylis glomerata pollen is the pollen of the Dactylis glomerata plant. Dactylis glomerata pollen is mainly used in allergenic testing."
"DRUGBANK:DB06812"	"Povidone-iodine"	"UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T109; Povidone-iodine is a stable chemical complex of polyvinylpyrrolidone (povidone, PVP) and elemental iodine. It contains from 9.0% to 12.0% available iodine, calculated on a dry basis. This unique complex was discovered in 1955 at the Industrial Toxicology Laboratories in Philadelphia by H. A. Shelanski and M. V. Shelanski. During in vitro testing to demonstrate anti-bacterial activity it was found that the complex was less toxic in mice than tincture of iodine. Human clinical trials showed the product to be superior to other iodine formulations. Povidone-iodine was immediately marketed, and has since become the universally preferred iodine antiseptic."
"DRUGBANK:DB11305"	"Quaternium-18"	"Quaternium-18 is a mixture of quaternary ammonium chloride salts. Quaternium-18 Hectorite and Bentonite are the reaction products of Quaternium-18 with clays. These compounds are poorly absorbed through the skin. Acute oral and percutaneous toxicity tests in animals indicate that they exhibit little or no systemic toxic effects."
"DRUGBANK:DB10357"	"Zea mays pollen"	"UMLS Semantic Type: UMLS_STY:T130; UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T129; Zea mays pollen is the pollen of the Zea mays plant. Zea mays pollen is mainly used in allergenic testing."
"DRUGBANK:DB04023"	"GDP-alpha-D-mannuronic acid"	"A nucleotide-sugar oxoanion obtained by deprotonation of the diphosphate OH groups of GDP-D-mannuronic acid; major species at pH 7.3."

edeutsch commented 3 years ago

so I'm seeing some things I'm thinking of fixing in the NodeSynonymizer output. @amykglen how does that affect what you're currently doing?

amykglen commented 3 years ago

so I believe the only pieces left for #1233 are addressing the failing FET/COHD/DTD tests (which I believe @chunyuma is looking into).

if you were to change the synonymizer, it wouldn't be a big deal from my end to regenerate KG2c again (don't think it should break anything in the kg2.5integration branch). although if preferred curies and/or preferred categories change, I suppose that may affect @chunyuma's rebuilding of DTD/COHD, which I believe are underway already(?)

edeutsch commented 3 years ago

@amykglen I just found and squashed a very large bug related to CHEMBL112 not being the expected leader. I plan on regenerating a new NodeSyn database. Depending on where you are, you can switch to it or not. It should not affect how many of each kind of element there are, but it will change a lot of leaders to what was originally intended.

amykglen commented 3 years ago

confirmed CHEBI:16108 no longer has a massive description on production: https://arax.ncats.io/?r=5558

good to close?

dkoslicki commented 3 years ago

Yes, good to close (and the more descriptive descriptions issue moving to #1316 )

RTXteam / RTX

Massive description and unknown category for one node #1306