Closed chunyuma closed 3 years ago
Related to #1300
And related to #1316
@chunyuma - I agree it's not ideal that the KG2c descriptions aren't always as informative as they could be. picking the longest one is an option (though I would revise that slightly and say it should pick the longest one that's not over some limit - say, 10,000 characters - in order to avoid what happened in #1306). alternatively I think some more refined ideas have been thrown out there in #1316. I guess at this point it's just a matter of picking a new method.
@amykglen, ah, I see. Thanks! I'm interested in these informative descriptions because I think it might help for the node embedding of DTD model. We can use the nlp technique to convert these informative description to node embedding so that the model can better understand what each node means.
Regarding to 10,000 characters limit, I'm curious about what causes this limit? Is it neo4j or DSL query?
As for a new method to pick the "informative" or human-readable description, I did try a few methods yesterday.
1). Utilized a tokenizer from Biobert model. Its idea is that based on the pre-trained biomedical language model with huge biomedical text mining training, we can tokenize the words from a text. Then we can add some rules (undetected words or the length of token has to be >2) to filter out invalid tokens. My assumption is that a the most informative human-readable description should be the one with the most detected human-readable words. I used the texts you posed in #1316 for testing, here is the result:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('dmis-lab/biobert-v1.1')
In [4]: [token for token in tokenizer.tokenize("3-hydroxy-2-oxopropyl hydrogen phosphate; FULL_MW:170.06; MAX_FDA_APPROVAL_PHASE: 0") if not '#' in token and to
...: ken.isalpha() and len(token)>2]
Out[4]: ['hydrogen', 'phosphate', 'FDA']
In [6]: [token for token in tokenizer.tokenize("UMLS Semantic Type: UMLS_STY:T123; UMLS Semantic Type: UMLS_STY:T109; Dihydroxyacetone phosphate is an important
...: intermediate in lipid biosynthesis and in glycolysis. Dihydroxyacetone phosphate has been investigated for the treatment of Lymphoma, Large-Cell, Diffu
...: se.") if not '#' in token and token.isalpha() and len(token)>2]
Out[6]:
['Type', 'Type', 'phosphate', 'important', 'intermediate', 'lip', 'bio', 'and', 'phosphate', 'has', 'been', 'investigated', 'for', 'the', 'treatment', 'Large', 'Cell']
Obviously, the second description can detect more words than the first one.
2). I also tried one of the method proposed by @dkoslicki which uses some scores to evaluate how readable a text is. I used a python library teststat for testing here. It seems like this method might also work. Here is an example (lower grade means easier readable for human):
In [1]: import textstat
In [2]: test = '3-hydroxy-2-oxopropyl hydrogen phosphate; FULL_MW:170.06; MAX_FDA_APPROVAL_PHASE: 0'
In [3]: textstat.text_standard(test)
Out[3]: '42nd and 43rd grade'
In [4]: test = ''
In [5]: test = 'UMLS Semantic Type: UMLS_STY:T123; UMLS Semantic Type: UMLS_STY:T109; Dihydroxyacetone phosphate is an important intermediate in lipid biosynthe
...: sis and in glycolysis. Dihydroxyacetone phosphate has been investigated for the treatment of Lymphoma, Large-Cell, Diffuse.'
In [6]: textstat.text_standard(test)
Out[6]: '12th and 13th grade'
In [7]: test = 'UMLS Semantic Type: UMLS_STY:T123; UMLS Semantic Type: UMLS_STY:T109; Dihydroxyacetone phosphate is an important intermediate in lipid biosynthe
...: sis and in glycolysis. Dihydroxyacetone phosphate has been investigated for the treatment of Lymphoma, Large-Cell, Diffuse.'
In [8]: textstat.text_standard(test)
Out[8]: '12th and 13th grade'
In [9]: test = 'Dihydroxyacetone phosphate is an important intermediate in lipid biosynthesis and in glycolysis. Dihydroxyacetone phosphate has been investigate
...: d for the treatment of Lymphoma, Large-Cell, Diffuse.'
In [10]: textstat.text_standard(test)
Out[10]: '12th and 13th grade'
In [11]: test = '-!- FUNCTION: Initiates the extrinsic pathway of blood coagulation. Serine protease that circulates in the blood in a zymogen form. Factor VII
...: is converted to factor VIIa by factor Xa, factor XIIa, factor IXa, or thrombin by minor proteolysis. In the presence of tissue factor and calcium ions,
...: factor VIIa then converts factor X to factor Xa by limited proteolysis. Factor VIIa will also convert factor IX to factor IXa in the presence of tissu
...: e factor and calcium. -!- CATALYTIC ACTIVITY: Reaction=Selective cleavage of Arg-|-Ile bond in factor X to form factor Xa.; EC=3.4.21.21; -!- SUBUNIT:
...: Heterodimer of a light chain and a heavy chain linked by a disulfide bond. {ECO:0000269|PubMed:8598903, ECO:0000269|PubMed:9925787}. -!- INTERACTION: P
...: 08709; P13726: F3; NbExp=7; IntAct=EBI-355972, EBI-1040727; -!- SUBCELLULAR LOCATION: Secreted. -!- ALTERNATIVE PRODUCTS: Event=Alternative splicing; N
...: amed isoforms=2; Name=A; IsoId=P08709-1; Sequence=Displayed; Name=B; IsoId=P08709-2; Sequence=VSP_005387; -!- TISSUE SPECIFICITY: Plasma. -!- PTM: The
...: vitamin K-dependent, enzymatic carboxylation of some glutamate residues allows the modified protein to bind calcium. -!- PTM: The iron and 2-oxoglutara
...: te dependent 3-hydroxylation of aspartate and asparagine is (R) stereospecific within EGF domains. {ECO:0000269|PubMed:3264725}. -!- PTM: O- and N-glyc
...: osylated. N-glycosylation at Asn-205 occurs cotranslationally and is mediated by STT3A-containing complexes, while glycosylation at Asn-382 is post-tra
...: nslational and is mediated STT3B-containing complexes before folding. O-fucosylated by POFUT1 on a conserved serine or threonine residue found in the c
...: onsensus sequence C2-X(4,5)-[S/T]-C3 of EGF domains, where C2 and C3 are the second and third conserved cysteines. {ECO:0000269|PubMed:1904059, ECO:000
...: 0269|PubMed:19167329, ECO:0000269|PubMed:21949356, ECO:0000269|PubMed:3264725, ECO:0000269|PubMed:9023546}. -!- PTM: Can be either O-glucosylated or O-
...: xylosylated at Ser-112 by POGLUT1 in vitro. -!- DISEASE: Factor VII deficiency (FA7D) [MIM:227500]: A hemorrhagic disease with variable presentation. T
...: he clinical picture can be very severe, with the early occurrence of intracerebral hemorrhages or repeated hemarthroses, or, in contrast, moderate with
...: cutaneous-mucosal hemorrhages (epistaxis, menorrhagia) or hemorrhages provoked by a surgical intervention. Finally, numerous subjects are completely a
...: symptomatic despite very low factor VII levels. {ECO:0000269|PubMed:10862079, ECO:0000269|PubMed:11091194, ECO:0000269|PubMed:11129332, ECO:0000269|Pub
...: Med:12472587, ECO:0000269|PubMed:14717781, ECO:0000269|PubMed:1634227, ECO:0000269|PubMed:18976247, ECO:0000269|PubMed:19432927, ECO:0000269|PubMed:197
...: 51712, ECO:0000269|PubMed:2070047, ECO:0000269|PubMed:21206266, ECO:0000269|PubMed:21372693, ECO:0000269|PubMed:26761581, ECO:0000269|PubMed:7974346, E
...: CO:0000269|PubMed:7981691, ECO:0000269|PubMed:8043443, ECO:0000269|PubMed:8204879, ECO:0000269|PubMed:8242057, ECO:0000269|PubMed:8364544, ECO:0000269|
...: PubMed:8652821, ECO:0000269|PubMed:8844208, ECO:0000269|PubMed:8883260, ECO:0000269|PubMed:8940045, ECO:0000269|PubMed:9414278, ECO:0000269|PubMed:9452
...: 082, ECO:0000269|PubMed:9576180}. Note=The disease is caused by variants affecting the gene represented in this entry. -!- PHARMACEUTICAL: Available un
...: der the names Niastase or Novoseven (Novo Nordisk). Used for the treatment of bleeding episodes in hemophilia A or B patients with antibodies to coagul
...: ation factors VIII or IX. -!- SIMILARITY: Belongs to the peptidase S1 family. {ECO:0000255|PROSITE-ProRule:PRU00274}. -!- WEB RESOURCE: Name=Wikipedia;
...: Note=Factor VII entry; URL="https://en.wikipedia.org/wiki/Factor_VII"; -!- WEB RESOURCE: Name=SeattleSNPs; URL="http://pga.gs.washington.edu/data/f7/"
...: ; -!- WEB RESOURCE: Name=SHMPD; Note=The Singapore human mutation and polymorphism database; URL="http://shmpd.bii.a-star.edu.sg/gene.php?genestart=A&g
...: enename=F7"; ; Short=SPCA; AltName: INN=Eptacog alfaEvidence Codes from Name:'
In [12]: textstat.text_standard(test)
Out[12]: '22nd and 23rd grade'
In [13]: test = 'Initiates the extrinsic pathway of blood coagulation. Serine protease that circulates in the blood in a zymogen form. Factor VII is converted t
...: o factor VIIa by factor Xa, factor XIIa, factor IXa, or thrombin by minor proteolysis. In the presence of tissue factor and calcium ions, factor VIIa t
...: hen converts factor X to factor Xa by limited proteolysis. Factor VIIa will also convert factor IX to factor IXa in the presence of tissue factor and c
...: alcium.'
In [14]: textstat.text_standard(test)
Out[14]: '11th and 12th grade'
very cool! I think that'd be great for the KG2c build process to use such a tool... is it something we could add a method for somewhere in the KG2c build code? (like, maybe the method could take in a list of descriptions and return the 'best' one?) and do they seem reasonably performant?
the 10,000 character limit is just an arbitrary limit I chose to prevent what happened in #1306 - basically, some of the KG2 descriptions are super long (for example, #1310) and that can crash the UI or otherwise really bog things down for KG2c/ARAX (#1306). so I put this limit in place to prevent super long descriptions from being chosen.
is it something we could add a method for somewhere in the KG2c build code? (like, maybe the method could take in a list of descriptions and return the 'best' one?)
Yes, I think I can't just depend on only one tool or one method. I think in order to make the result more accurate, perhaps we might need to make a function (eg. named 'select_best_description') that combines multiple tools or rules to choose the 'best' one.
Here is another example:
In [1]: import textstat
In [2]: test = 'UMLS Semantic Type: UMLS_STY:T047; UMLS Semantic Type: UMLS_STY:T033; UMLS Semantic Type: UMLS_STY:T028'
In [3]: textstat.text_standard(test)
Out[3]: '7th and 8th grade'
In [4]: test = 'Cytochrome P450 1A2 (515 aa, ~58 kDa) is encoded by the human CYP1A2 gene. This protein is involved in the hydroxylation of fatty acids, steroid
...: s and xenobiotics.; UMLS Semantic Type: UMLS_STY:T126; UMLS Semantic Type: UMLS_STY:T116'
In [5]: textstat.text_standard(test)
Out[5]: '8th and 9th grade'
In [6]: test = 'A protein that is a translation product of the human CYP1A2 gene or a 1:1 ortholog thereof. // COMMENTS: Category=gene.'
In [7]: textstat.text_standard(test)
Out[7]: '11th and 12th grade'
As you can see that textstat
considers UMLS Semantic Type: UMLS_STY:T047; UMLS Semantic Type: UMLS_STY:T033; UMLS Semantic Type: UMLS_STY:T028
is easier to read because it has lower grade but actually it is uninformative. Perhaps we might need to design a few rules like 1) if it is the longest description subject to 10,000 characters 2) if it has sufficient human-readable words detected based on biobert model 3) if it has lower grade. Based these rules, I think perhaps we might select the 'best' one.
I like the multiple method approach (takes care of those annoying edge cases like you point out @chunyuma )
nice, sounds great to me! and yeah, adding a function like select_best_description()
sounds fantastic.
@amykglen, if you're ok, I can be responsible for writing such function and send it to you to add somewhere in the KG2c build process
sounds great! thanks, @chunyuma!
Hi @amykglen, I've already added a class named 'select_best_description' to this file in master
branch for selecting the best description in kg2c build process.
Here is an example to show how to use it (You might need to update your packages via requirement.txt):
import os
import sys
pathlist = os.path.realpath(__file__).split(os.path.sep)
RTXindex = pathlist.index("RTX")
sys.path.append(os.path.sep.join([*pathlist[:(RTXindex + 1)], 'code', 'kg2', 'canonicalized']))
from utils import select_best_description
#### initialize an object
selector = select_best_description(description_list)
#### To get a dict with descriptions and their corresponding final score
selector.get_final_score
#### To get the best description with the lowest score
selector.get_best_description
Please note that if a description is empty, it should be set as None. And I didn't set 10,000 characters limit within this function. So if you want to set it, you have to do this before using this function.
Here are a few examples that I tested based on kg2.5.2c:
All descriptions of this preferred curie and its equivalent_curies:
['UMLS Semantic Type: UMLS_STY:T001', None]
Best description:
selector.get_best_description
'UMLS Semantic Type: UMLS_STY:T001'
All descriptions of this preferred curie and its equivalent_curies:
['UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T109',
'An organic bromide salt of distigmine. It is an anticholinesterase drug used for the treatment of myasthenia gravis and postoperative urinary retention.',
'A carbamate ester resulting from the formal condensation of both carboxy groups of hexane-1,6-diylbis(methylcarbamic acid) with the hydroxy group of 3-hydroxy-1-methylpyridinium.',
'DISTIGMINE BROMIDE; FULL_MW:576.33; MAX_FDA_APPROVAL_PHASE: 4',
'DISTIGMINE; FULL_MW:416.52; MAX_FDA_APPROVAL_PHASE: 4',
'Distigmine is a parasympathomimetic agent with a longer duration of action and enhanced drug accumulation compared to [DB00545] and [DB01400]. It is an anticholinergic drug and long-acting reversible carbamate cholinesterase inhibitor that binds directly and competitively to the agonist binding sites of muscurinic receptors. Distigmine is available in several countries as a treatment of underactive detrusor and voiding dysfunction in the urinary tract where the active ingredient is distigmine bromide. It improves detrusor function thereby restoring normal voiding patterns in patients suffering from detrusor underactivity [A27176].',
'None',
'UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T109',
'UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T109',
'UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T109',
'UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T109',
'UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T109',
'UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T109',
None,
None,
None,
None]
Best description:
selector.get_best_description
'Distigmine is a parasympathomimetic agent with a longer duration of action and enhanced drug accumulation compared to [DB00545] and [DB01400]. It is an anticholinergic drug and long-acting reversible carbamate cholinesterase inhibitor that binds directly and competitively to the agonist binding sites of muscurinic receptors. Distigmine is available in several countries as a treatment of underactive detrusor and voiding dysfunction in the urinary tract where the active ingredient is distigmine bromide. It improves detrusor function thereby restoring normal voiding patterns in patients suffering from detrusor underactivity [A27176].'
All descriptions of this preferred curie and its equivalent_curies:
['UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T126; UMLS Semantic Type: UMLS_STY:T116',
'Coagulation factor XIII; TARGET_TYPE: SINGLE PROTEIN',
'UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T126; UMLS Semantic Type: UMLS_STY:T116',
'UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T126; UMLS Semantic Type: UMLS_STY:T116']
Best description:
selector.get_best_description
'Coagulation factor XIII; TARGET_TYPE: SINGLE PROTEIN'
All descriptions of this preferred curie and its equivalent_curies:
['UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T126; UMLS Semantic Type: UMLS_STY:T116',
'UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T126; UMLS Semantic Type: UMLS_STY:T116',
'Lysozyme C; TARGET_TYPE: SINGLE PROTEIN',
'Lysozyme C; TARGET_TYPE: SINGLE PROTEIN',
'Lysozyme; TARGET_TYPE: SINGLE PROTEIN',
'UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T126; UMLS Semantic Type: UMLS_STY:T116',
None,
None,
'Food additive; technological purpose(s): preservative. // COMMENTS: LanguaL curation note: See "food additive" comments.',
'A protein coding gene LYZ in human. // COMMENTS: Category=external.; UMLS Semantic Type: UMLS_STY:T028',
'UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T126; UMLS Semantic Type: UMLS_STY:T116',
'UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T126; UMLS Semantic Type: UMLS_STY:T116',
'UMLS Semantic Type: UMLS_STY:T059',
'UMLS Semantic Type: UMLS_STY:T126; UMLS Semantic Type: UMLS_STY:T116',
'Type:protein-coding; Locus:12q15; NameStatus:official',
'The determination of the lysozyme present in a sample.; UMLS Semantic Type: UMLS_STY:T059',
'Lysozyme C (148 aa, ~17 kDa) is encoded by the human LYZ gene. This protein is involved in the proteolytic degradation of bacterial peptidoglycans.; UMLS Semantic Type: UMLS_STY:T126; UMLS Semantic Type: UMLS_STY:T116',
'UMLS Semantic Type: UMLS_STY:T047; UMLS Semantic Type: UMLS_STY:T028',
None,
'A protein that is a translation product of the human LYZ gene or a 1:1 ortholog thereof. // COMMENTS: Category=gene.',
'A lysozyme C that is encoded in the genome of human. // COMMENTS: Category=organism-gene.',
'UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T126; UMLS Semantic Type: UMLS_STY:T116',
'UMLS Semantic Type: UMLS_STY:T059',
'UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T126; UMLS Semantic Type: UMLS_STY:T116',
'Lysozyme C (148 aa, ~17 kDa) is encoded by the human LYZ gene. This protein is involved in the proteolytic degradation of bacterial peptidoglycans.; UMLS Semantic Type: UMLS_STY:T126; UMLS Semantic Type: UMLS_STY:T116',
'The determination of the lysozyme present in a sample.; UMLS Semantic Type: UMLS_STY:T059',
'UMLS Semantic Type: UMLS_STY:T121; UMLS Semantic Type: UMLS_STY:T126; UMLS Semantic Type: UMLS_STY:T116',
'-!- FUNCTION: Lysozymes have primarily a bacteriolytic function; those in tissues and body fluids are associated with the monocyte-macrophage system and enhance the activity of immunoagents. -!- CATALYTIC ACTIVITY: Reaction=Hydrolysis of (1->4)-beta-linkages between N-acetylmuramic acid and N-acetyl-D-glucosamine residues in a peptidoglycan and between N-acetyl-D-glucosamine residues in chitodextrins.; EC=3.2.1.17; -!- SUBUNIT: Monomer. -!- INTERACTION: P61626; P61626: LYZ; NbExp=3; IntAct=EBI-355360, EBI-355360; -!- SUBCELLULAR LOCATION: Secreted. -!- DISEASE: Amyloidosis 8 (AMYL8) [MIM:105200]: A form of hereditary generalized amyloidosis. Clinical features include extensive visceral amyloid deposits, renal amyloidosis resulting in nephrotic syndrome, arterial hypertension, hepatosplenomegaly, cholestasis, petechial skin rash. There is no involvement of the nervous system. {ECO:0000269|PubMed:8464497}. Note=The disease is caused by variants affecting the gene represented in this entry. -!- MISCELLANEOUS: Lysozyme C is capable of both hydrolysis and transglycosylation; it shows also a slight esterase activity. It acts rapidly on both peptide-substituted and unsubstituted peptidoglycan, and slowly on chitin oligosaccharides. -!- SIMILARITY: Belongs to the glycosyl hydrolase 22 family. {ECO:0000255|PROSITE-ProRule:PRU00680}. -!- SEQUENCE CAUTION: Sequence=CAA32175.1; Type=Erroneous initiation; Evidence={ECO:0000305}; -!- WEB RESOURCE: Name=Wikipedia; Note=Lysozyme entry; URL="https://en.wikipedia.org/wiki/Lysozyme"; Evidence Codes from Name: SEQUENCE 148 AA; 16537 MW; 8ECFD276BEB2678A CRC64MKALIVLGLV LLSVTVQGKV FERCELARTL KRLGMDGYRG ISLANWMCLA KWESGYNTRATNYNAGDRST DYGIFQINSR YWCNDGKTPG AVNACHLSCS ALLQDNIADA VACAKRVVRDPQGIRAWVAW RNRCQNRDVR QYVQGCGV; This gene encodes human lysozyme, whose natural substrate is the bacterial cell wall peptidoglycan (cleaving the beta[1-4]glycosidic linkages between N-acetylmuramic acid and N-acetylglucosamine). Lysozyme is one of the antimicrobial agents found in human milk, and is also present in spleen, lung, kidney, white blood cells, plasma, saliva, and tears. The protein has antibacterial activity against a number of bacterial species. Missense mutations in this gene have been identified in heritable renal amyloidosis. [provided by RefSeq, Oct 2014].']
Best description:
selector.get_best_description
'-!- FUNCTION: Lysozymes have primarily a bacteriolytic function; those in tissues and body fluids are associated with the monocyte-macrophage system and enhance the activity of immunoagents. -!- CATALYTIC ACTIVITY: Reaction=Hydrolysis of (1->4)-beta-linkages between N-acetylmuramic acid and N-acetyl-D-glucosamine residues in a peptidoglycan and between N-acetyl-D-glucosamine residues in chitodextrins.; EC=3.2.1.17; -!- SUBUNIT: Monomer. -!- INTERACTION: P61626; P61626: LYZ; NbExp=3; IntAct=EBI-355360, EBI-355360; -!- SUBCELLULAR LOCATION: Secreted. -!- DISEASE: Amyloidosis 8 (AMYL8) [MIM:105200]: A form of hereditary generalized amyloidosis. Clinical features include extensive visceral amyloid deposits, renal amyloidosis resulting in nephrotic syndrome, arterial hypertension, hepatosplenomegaly, cholestasis, petechial skin rash. There is no involvement of the nervous system. {ECO:0000269|PubMed:8464497}. Note=The disease is caused by variants affecting the gene represented in this entry. -!- MISCELLANEOUS: Lysozyme C is capable of both hydrolysis and transglycosylation; it shows also a slight esterase activity. It acts rapidly on both peptide-substituted and unsubstituted peptidoglycan, and slowly on chitin oligosaccharides. -!- SIMILARITY: Belongs to the glycosyl hydrolase 22 family. {ECO:0000255|PROSITE-ProRule:PRU00680}. -!- SEQUENCE CAUTION: Sequence=CAA32175.1; Type=Erroneous initiation; Evidence={ECO:0000305}; -!- WEB RESOURCE: Name=Wikipedia; Note=Lysozyme entry; URL="https://en.wikipedia.org/wiki/Lysozyme"; Evidence Codes from Name: SEQUENCE 148 AA; 16537 MW; 8ECFD276BEB2678A CRC64MKALIVLGLV LLSVTVQGKV FERCELARTL KRLGMDGYRG ISLANWMCLA KWESGYNTRATNYNAGDRST DYGIFQINSR YWCNDGKTPG AVNACHLSCS ALLQDNIADA VACAKRVVRDPQGIRAWVAW RNRCQNRDVR QYVQGCGV; This gene encodes human lysozyme, whose natural substrate is the bacterial cell wall peptidoglycan (cleaving the beta[1-4]glycosidic linkages between N-acetylmuramic acid and N-acetylglucosamine). Lysozyme is one of the antimicrobial agents found in human milk, and is also present in spleen, lung, kidney, white blood cells, plasma, saliva, and tears. The protein has antibacterial activity against a number of bacterial species. Missense mutations in this gene have been identified in heritable renal amyloidosis. [provided by RefSeq, Oct 2014].'
You're welcome to take more tests and if you feel the result is not reasonable, please let me know.
This is looking great @chunyuma ! And thanks for updating the requirements.txt
file too.
awesome, thanks @chunyuma! it seems to me like it's doing a good job picking descriptions in testing!
the only issue I'm seeing is that I think it will take 40 hours to choose the best descriptions for all the nodes in KG2c (it took about 4 minutes to do 10,000 nodes, and there are about 6,000,000 nodes in KG2c).
so, this would bump up the KG2c build time from about 5 hours to 45 hours. is there any way to speed it up? I'm not sure it's worth it to add 40 hours to KG2c's build time..
Thanks for reporting this issue @amykglen, I can investigate it.
also, interestingly, I just compared your method to a method of choosing the longest description under 10,000 characters, and they chose the same description for 99.8% of the 10,000 nodes I tested
so worst case if the code can't be sped up, perhaps that simple rule would be a solid alternative (super cool to be able to verify it using your method)
@chunyuma It might be worth looking into if the long descriptions are "worth it." Currently, it seems that the bias is towards long descriptions. Perhaps a bit of tuning of the weightings might help
@dkoslicki, I don't think the bias is so serious toward long descriptions. If you've seen the description of my method here, it's basically based on three rules (the length of description, the number of detected human-readable words from biobert model and the estimated school grade level returned from textstat.text_standard), they currently have the equal weight. Unless the long description ranks top across all candidate descriptions in all these three parts, it will be assigned a 'good' score. Based on @amykglen's investigation, it might mean that the long description tends to win or at least rank the top in all these three parts.
I guess what I was trying to say was: what advantage does the 3-pringed approach have over just taking the longest description below 10,000 characters? Agreement over 99% of your method vs. longest description seems to suggest bias. Given Amy's analysis, it seems odd that biobert "readable" plus textstat
low grade scores would correlate so highly with long descriptions. I'm wondering if a) biobert might be unnormalized (so long descriptions would allow for more recognizable words to appear) or b) long descriptions happen to have many "simple" words (eg. a, an, the, or, but, etc.) throwing off textstat
. I quite like your approach, but perhaps it might need a little bit of tweaking to ensure it's not duplicating "just take the longest"
@dkoslicki, for the biobert "readable" word, in my method, actually I also require the length of token (effective word) should be at least 3 so I think this can effectively avoid to consider many "simple" words like those you point out (eg. a, an, or). If you worried about this, perhaps we can improve the length limit to 4 or even longer. Then it can remove those "simple" in consideration. I'm quite curious why you think the long description should not be an appropriate description. If you would provide an example to say that the long description might not be an appropriate description, then I think it can help me optimize my method.
Hi @amykglen, have you ever seen some cases that the longest description might not be an appropriate description? If you have, is it possible to how me some of these cases? I think this might help us decide whether we can use the longest description method to replace my method if I finally find that my method can't be sped up. Thanks!
@amykglen, I did an investigation regarding the running time of my method. I ran this method for 10,000 nodes randomly picked from kg2.5.2c and calculate the average time of running the method on a single node. And I did this ~100 times, the figure below shows the distribution of the average time for this 102 tests. The x axis is the average time of single run in second. As you can, in most of time, running a single node just needs around 0.0025 seconds. So running 10,000 nodes might need 0.002510000 = 25 seconds. But in some cases, it did need average 0.02 seconds per node and thus this causes (0.02 10000)/60 = 3.33 minutes. So perhaps the case that you reported last night might be a special case (if you have time, could you please help me test a few more times on your machine?). In most of time, it should be fast. I can't think of a way to further optimize the running time. If it is still slow, perhaps we can try running it in parallel.
Also, I'm not sure if the running time would be affected by the machine configuration and environment. The machine that I'm running has 36 virtual cpus and 1 gpu. But I'm pretty sure the gpu is not used for this test. And here is the RAM info and CPU info for running a script to test this method on 10,000 nodes (the script includes some IO operations so it might take longer time than 25 seconds).
Command being timed: "python test.py"
User time (seconds): 44.05
System time (seconds): 1.78
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:46.55
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3320088
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1601526
Voluntary context switches: 40
Involuntary context switches: 170
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
thanks for all the info! I'll do some more testing on my end... I'm seeing about half a second for a list of 3 to 5 descriptions on my machine, but maybe the instance we build KG2c on could get through these faster... will do some experimenting.
when you say your 'ran this method for 10,000 nodes randomly picked from kg2.5.2c', does that mean for each of the 10,000 nodes, you grabbed the descriptions for all of their equivalent curies in KG2.5.2, and then fed that list into your get_best_description
function?
when you say your 'ran this method for 10,000 nodes randomly picked from kg2.5.2c', does that mean for each of the 10,000 nodes, you grabbed the descriptions for all of their equivalent curies in KG2.5.2, and then fed that list into your get_best_description function?
So this is how I did. First, I grabbed all curies and their descriptions from KG2.5.2 and saved it in a dictionary. Then I grabbed all preferred curies and their equivalent cures from KG2.5.2c. Then based on these info, I can construct a dictionary with the preferred node as key and the list of descriptions of its equivalent curies as value. So I randomly picked 10,000 keys for a test.
ok - I switched to using @chunyuma's method in parallel, and on buildkg2c.rtx.ai (which has 16 cpus), my experimentations suggest it should take just under 4 hours for all of the nodes in KG2c, which seems pretty reasonable to me.
Hi @amykglen, have you ever seen some cases that the longest description might not be an appropriate description? If you have, is it possible to how me some of these cases?
because we're excluding descriptions over 10,000 characters, I think it is generally true that the longer descriptions will be better. the best example I know of where the longest description isn't super readable are with some UniProtKB nodes - but even then, I think the longest is still typically better than the other options.
I did do a little more experimentation comparing @chunyuma's method to just taking the longest description under 10,000 characters, and when I preprocess the descriptions and strip all of the "UMLS Semantic Type: UMLS_STY:XXXX;" bits out, the two methods chose the same descriptions for 47,972 / 48,039 nodes (99.9%).
@amykglen, thanks a lot for doing these experimentations. It is very interesting. So perhaps we can have two choices to choose the 'best' descriptions: One is using this multiple-approach method and another is just taking the longest description.
I did do a little more experimentation comparing @chunyuma's method to just taking the longest description under 10,000 characters, and when I preprocess the descriptions and strip all of the "UMLS Semantic Type: UMLS_STY:XXXX;" bits out, the two methods chose the same descriptions for 47,972 / 48,039 nodes (99.9%).
@amykglen, could you also help figure out what descriptions the multiple-approach method chooses for those 0.1% difference? I'm curious that based on the assumption that the longer might be better, why my method didn't take the longest one under 10,000 for those 0.1% difference. Perhaps this might help evaluate if my method indeed can get the 'best' one rather than always the longest one.
This might be two good examples to compare the multiple-approach-combined method and the longest method and determined which might be better:
All descriptions:
['-!- FUNCTION: Possesses tyrosine phosphatase activity. {ECO:0000269|PubMed:19167335}. -!- CATALYTIC ACTIVITY: Reaction=H2O + O-phospho-L-tyrosyl-[protein] = L-tyrosyl-[protein] + phosphate; Xref=Rhea:RHEA:10684, Rhea:RHEA-COMP:10136, Rhea:RHEA-COMP:10137, ChEBI:CHEBI:15377, ChEBI:CHEBI:43474, ChEBI:CHEBI:46858, ChEBI:CHEBI:82620; EC=3.1.3.48; Evidence={ECO:0000255|PROSITE-ProRule:PRU10044, ECO:0000269|PubMed:19167335}; -!- SUBUNIT: Monomer; active form. Homodimer; inactive form (Probable). Interacts with CNTN3, CNTN4, CNTN5 and CNTN6. {ECO:0000269|PubMed:19167335, ECO:0000269|PubMed:20133774, ECO:0000305}. -!- INTERACTION: P23470; P35222: CTNNB1; NbExp=2; IntAct=EBI-2258115, EBI-491549; P23470; P00533: EGFR; NbExp=3; IntAct=EBI-2258115, EBI-297353; -!- SUBCELLULAR LOCATION: Membrane {ECO:0000305}; Single-pass type I membrane protein {ECO:0000305}. -!- ALTERNATIVE PRODUCTS: Event=Alternative splicing; Named isoforms=2; Name=1; IsoId=P23470-1; Sequence=Displayed; Name=2; IsoId=P23470-2; Sequence=VSP_024353; -!- TISSUE SPECIFICITY: Found in a variety of tissues. -!- SIMILARITY: Belongs to the protein-tyrosine phosphatase family. Receptor class 5 subfamily. {ECO:0000305}. -!- SEQUENCE CAUTION: Sequence=BAD93108.1; Type=Erroneous initiation; Evidence={ECO:0000305}; -!- WEB RESOURCE: Name=Atlas of Genetics and Cytogenetics in Oncology and Haematology; URL="http://atlasgeneticsoncology.org/Genes/PTPRGID41930ch3p21.html"; ; Short=Protein-tyrosine phosphatase gamma; Short=R-PTP-gammaEvidence Codes from Name: SEQUENCE 1445 AA; 162003 MW; A48A007BA14082BC CRC64MRRLLEPCWW ILFLKITSSV LHYVVCFPAL TEGYVGALHE NRHGSAVQIR RRKASGDPYWAYSGAYGPEH WVTSSVSCGG RHQSPIDILD QYARVGEEYQ ELQLDGFDNE SSNKTWMKNTGKTVAILLKD DYFVSGAGLP GRFKAEKVEF HWGHSNGSAG SEHSINGRRF PVEMQIFFYNPDDFDSFQTA ISENRIIGAM AIFFQVSPRD NSALDPIIHG LKGVVHHEKE TFLDPFVLRDLLPASLGSYY RYTGSLTTPP CSEIVEWIVF RRPVPISYHQ LEAFYSIFTT EQQDHVKSVEYLRNNFRPQQ RLHDRVVSKS AVRDSWNHDM TDFLENPLGT EASKVCSSPP IHMKVQPLNQTALQVSWSQP ETIYHPPIMN YMISYSWTKN EDEKEKTFTK DSDKDLKATI SHVSPDSLYLFRVQAVCRND MRSDFSQTML FQANTTRIFQ GTRIVKTGVP TASPASSADM APISSGSSTWTSSGIPFSFV SMATGMGPSS SGSQATVASV VTSTLLAGLG FGGGGISSFP STVWPTRLPTAASASKQAAR PVLATTEALA SPGPDGDSSP TKDGEGTEEG EKDEKSESED GEREHEEDGEKDSEKKEKSG VTHAAEERNQ TEPSPTPSSP NRTAEGGHQT IPGHEQDHTA VPTDQTGGRRDAGPGLDPDM VTSTQVPPTA TEEQYAGSDP KRPEMPSKKP MSRGDRFSED SRFITVNPAEKNTSGMISRP APGRMEWIIP LIVVSALTFV CLILLIAVLV YWRGCNKIKS KGFPRRFREVPSSGERGEKG SRKCFQTAHF YVEDSSSPRV VPNESIPIIP IPDDMEAIPV KQFVKHIGELYSNNQHGFSE DFEEVQRCTA DMNITAEHSN HPENKHKNRY INILAYDHSR VKLRPLPGKDSKHSDYINAN YVDGYNKAKA YIATQGPLKS TFEDFWRMIW EQNTGIIVMI TNLVEKGRRKCDQYWPTENS EEYGNIIVTL KSTKIHACYT VRRFSIRNTK VKKGQKGNPK GRQNERVVIQYHYTQWPDMG VPEYALPVLT FVRRSSAARM PETGPVLVHC SAGVGRTGTY IVIDSMLQQIKDKSTVNVLG FLKHIRTQRN YLVQTEEQYI FIHDALLEAI LGKETEVSSN QLHSYVNSILIPGVGGKTRL EKQFKLVTQC NAKYVECFSA QKECNKEKNR NSSVVPSERA RVGLAPLPGMKGTDYINASY IMGYYRSNEF IITQHPLPHT TKDFWRMIWD HNAQIIVMLP DNQSLAEDEFVYWPSREESM NCEAFTVTLI SKDRLCLSNE EQIIIHDFIL EATQDDYVLE VRHFQCPKWPNPDAPISSTF ELINVIKEEA LTRDGPTIVH DEYGAVSAGM LCALTTLSQQ LENENAVDVFQVAKMINLMR PGVFTDIEQY QFIYKAMLSL VSTKENGNGP MTVDKNGAVL IADESDPAESMESLV; The protein encoded by this gene is a member of the protein tyrosine phosphatase (PTP) family. PTPs are known to be signaling molecules that regulate a variety of cellular processes including cell growth, differentiation, mitotic cycle, and oncogenic transformation. This PTP possesses an extracellular region, a single transmembrane region, and two tandem intracytoplasmic catalytic domains, and thus represents a receptor-type PTP. The extracellular region of this PTP contains a carbonic anhydrase-like (CAH) domain, which is also found in the extracellular region of PTPRBETA/ZETA. This gene is located in a chromosomal region that is frequently deleted in renal cell carcinoma and lung carcinoma, thus is thought to be a candidate tumor suppressor gene. [provided by RefSeq, Jul 2008].',
'Receptor-type tyrosine-protein phosphatase gamma (1445 aa, ~162 kDa) is encoded by the human PTPRG gene. This protein is involved in both protein dephosphorylation and signal transduction.',
'A receptor-type tyrosine-protein phosphatase gamma that is encoded in the genome of human. // COMMENTS: Category=organism-gene.',
'A protein that is a translation product of the human PTPRG gene or a 1:1 ortholog thereof. // COMMENTS: Category=gene.',
'Receptor-type tyrosine-protein phosphatase gamma; TARGET_TYPE: SINGLE PROTEIN',
'A protein coding gene PTPRG in human. // COMMENTS: Category=external.',
'Type:protein-coding; Locus:3p14.2; NameStatus:official']
the longest one:
'-!- FUNCTION: Possesses tyrosine phosphatase activity. {ECO:0000269|PubMed:19167335}. -!- CATALYTIC ACTIVITY: Reaction=H2O + O-phospho-L-tyrosyl-[protein] = L-tyrosyl-[protein] + phosphate; Xref=Rhea:RHEA:10684, Rhea:RHEA-COMP:10136, Rhea:RHEA-COMP:10137, ChEBI:CHEBI:15377, ChEBI:CHEBI:43474, ChEBI:CHEBI:46858, ChEBI:CHEBI:82620; EC=3.1.3.48; Evidence={ECO:0000255|PROSITE-ProRule:PRU10044, ECO:0000269|PubMed:19167335}; -!- SUBUNIT: Monomer; active form. Homodimer; inactive form (Probable). Interacts with CNTN3, CNTN4, CNTN5 and CNTN6. {ECO:0000269|PubMed:19167335, ECO:0000269|PubMed:20133774, ECO:0000305}. -!- INTERACTION: P23470; P35222: CTNNB1; NbExp=2; IntAct=EBI-2258115, EBI-491549; P23470; P00533: EGFR; NbExp=3; IntAct=EBI-2258115, EBI-297353; -!- SUBCELLULAR LOCATION: Membrane {ECO:0000305}; Single-pass type I membrane protein {ECO:0000305}. -!- ALTERNATIVE PRODUCTS: Event=Alternative splicing; Named isoforms=2; Name=1; IsoId=P23470-1; Sequence=Displayed; Name=2; IsoId=P23470-2; Sequence=VSP_024353; -!- TISSUE SPECIFICITY: Found in a variety of tissues. -!- SIMILARITY: Belongs to the protein-tyrosine phosphatase family. Receptor class 5 subfamily. {ECO:0000305}. -!- SEQUENCE CAUTION: Sequence=BAD93108.1; Type=Erroneous initiation; Evidence={ECO:0000305}; -!- WEB RESOURCE: Name=Atlas of Genetics and Cytogenetics in Oncology and Haematology; URL="http://atlasgeneticsoncology.org/Genes/PTPRGID41930ch3p21.html"; ; Short=Protein-tyrosine phosphatase gamma; Short=R-PTP-gammaEvidence Codes from Name: SEQUENCE 1445 AA; 162003 MW; A48A007BA14082BC CRC64MRRLLEPCWW ILFLKITSSV LHYVVCFPAL TEGYVGALHE NRHGSAVQIR RRKASGDPYWAYSGAYGPEH WVTSSVSCGG RHQSPIDILD QYARVGEEYQ ELQLDGFDNE SSNKTWMKNTGKTVAILLKD DYFVSGAGLP GRFKAEKVEF HWGHSNGSAG SEHSINGRRF PVEMQIFFYNPDDFDSFQTA ISENRIIGAM AIFFQVSPRD NSALDPIIHG LKGVVHHEKE TFLDPFVLRDLLPASLGSYY RYTGSLTTPP CSEIVEWIVF RRPVPISYHQ LEAFYSIFTT EQQDHVKSVEYLRNNFRPQQ RLHDRVVSKS AVRDSWNHDM TDFLENPLGT EASKVCSSPP IHMKVQPLNQTALQVSWSQP ETIYHPPIMN YMISYSWTKN EDEKEKTFTK DSDKDLKATI SHVSPDSLYLFRVQAVCRND MRSDFSQTML FQANTTRIFQ GTRIVKTGVP TASPASSADM APISSGSSTWTSSGIPFSFV SMATGMGPSS SGSQATVASV VTSTLLAGLG FGGGGISSFP STVWPTRLPTAASASKQAAR PVLATTEALA SPGPDGDSSP TKDGEGTEEG EKDEKSESED GEREHEEDGEKDSEKKEKSG VTHAAEERNQ TEPSPTPSSP NRTAEGGHQT IPGHEQDHTA VPTDQTGGRRDAGPGLDPDM VTSTQVPPTA TEEQYAGSDP KRPEMPSKKP MSRGDRFSED SRFITVNPAEKNTSGMISRP APGRMEWIIP LIVVSALTFV CLILLIAVLV YWRGCNKIKS KGFPRRFREVPSSGERGEKG SRKCFQTAHF YVEDSSSPRV VPNESIPIIP IPDDMEAIPV KQFVKHIGELYSNNQHGFSE DFEEVQRCTA DMNITAEHSN HPENKHKNRY INILAYDHSR VKLRPLPGKDSKHSDYINAN YVDGYNKAKA YIATQGPLKS TFEDFWRMIW EQNTGIIVMI TNLVEKGRRKCDQYWPTENS EEYGNIIVTL KSTKIHACYT VRRFSIRNTK VKKGQKGNPK GRQNERVVIQYHYTQWPDMG VPEYALPVLT FVRRSSAARM PETGPVLVHC SAGVGRTGTY IVIDSMLQQIKDKSTVNVLG FLKHIRTQRN YLVQTEEQYI FIHDALLEAI LGKETEVSSN QLHSYVNSILIPGVGGKTRL EKQFKLVTQC NAKYVECFSA QKECNKEKNR NSSVVPSERA RVGLAPLPGMKGTDYINASY IMGYYRSNEF IITQHPLPHT TKDFWRMIWD HNAQIIVMLP DNQSLAEDEFVYWPSREESM NCEAFTVTLI SKDRLCLSNE EQIIIHDFIL EATQDDYVLE VRHFQCPKWPNPDAPISSTF ELINVIKEEA LTRDGPTIVH DEYGAVSAGM LCALTTLSQQ LENENAVDVFQVAKMINLMR PGVFTDIEQY QFIYKAMLSL VSTKENGNGP MTVDKNGAVL IADESDPAESMESLV; The protein encoded by this gene is a member of the protein tyrosine phosphatase (PTP) family. PTPs are known to be signaling molecules that regulate a variety of cellular processes including cell growth, differentiation, mitotic cycle, and oncogenic transformation. This PTP possesses an extracellular region, a single transmembrane region, and two tandem intracytoplasmic catalytic domains, and thus represents a receptor-type PTP. The extracellular region of this PTP contains a carbonic anhydrase-like (CAH) domain, which is also found in the extracellular region of PTPRBETA/ZETA. This gene is located in a chromosomal region that is frequently deleted in renal cell carcinoma and lung carcinoma, thus is thought to be a candidate tumor suppressor gene. [provided by RefSeq, Jul 2008].'
the one picked by the multiple-approach-combined method:
'Receptor-type tyrosine-protein phosphatase gamma (1445 aa, ~162 kDa) is encoded by the human PTPRG gene. This protein is involved in both protein dephosphorylation and signal transduction.'
All descriptions:
['The flat triangle-shaped bone that connects the humerus with the clavicle in the back of the shoulder.; Also called the shoulder blade, it is a flat triangular bone, a pair of which form the back part of the shoulder girdle.',
'Also called the shoulder blade, it is a flat triangular bone, a pair of which form the back part of the shoulder girdle.',
'Endochondral bone that is dorsoventrally compressed and provides attachment site for muscles of the pectoral appendage.',
'The flat triangle-shaped bone that connects the humerus with the clavicle in the back of the shoulder.']
the longest one:
'The flat triangle-shaped bone that connects the humerus with the clavicle in the back of the shoulder.; Also called the shoulder blade, it is a flat triangular bone, a pair of which form the back part of the shoulder girdle.'
the one picked by the multiple-approach-combined method:
'Also called the shoulder blade, it is a flat triangular bone, a pair of which form the back part of the shoulder girdle.'
Based on these two cases, we can see that in the example1, the longest one is not considered as the best
one because it contains too many messy information which causes it to get a low score of easy reliability. However, in example2, due to the rule of easy reliability, it considers that the part of the longest description is easier to read than the longest one itself.
So if we don't care about the messy information in description, perhaps using the longest description might be better than using the multiple-approach-combined method. Also, picking the longest description runs faster.
Example 1 definitely shows that the more sophisticated approach gives better answers, while the second example is kind of a toss-up: either would be acceptable. Given it’s only ~4 hours of extra time to do the more sophisticated approach, I’d lean towards just keeping that. Who knows, it might serve to help us in the future if some gnarly KS is ingested with ridiculously long (or uninformative) descriptions.
sounds good to me! good point about future knowledge sources potentially having gnarly descriptions.
I made Chunyu's approach the one that the KG2c build uses - it'll be reflected in the KG2.6.x build of KG2c (hopefully happening sometime soon, pending #1432/#1423).
are we good to close this issue? KG2.6.7c contains descriptions chosen by @chunyuma's approach. they look good to me!
I think we can close this issue. I will close this issue now. Thanks @amykglen!
Hi @amykglen,
I'm wondering if we can keep the longest description from one of the preferred curies' synonyms in KG2C. Here is an example:
For the curie
CHEMBL.COMPOUND:CHEMBL1199307
in KG2C (endpoint: http://kg2c-5-2.rtx.ai:7474/browser/), currently its description is"DISTIGMINE; FULL_MW:416.52; MAX_FDA_APPROVAL_PHASE: 4"
.But based on its synonyms, I searched their original descriptions in KG2 (endpoint:http://kg2endpoint-kg2-5-2.rtx.ai:7474/browser/) with cypher query:
It seems like we can have more complete description for this preferred curie. And it looks like right now we only use the original description of the preferred curie id rather than the longest description of one of its synonyms. Just want to see if it is possible to keep the longest description in KG2C. Will this affect anything?