EBISPOT / efo

Github repo for the Experimental Factor Ontology (EFO)
https://www.ebi.ac.uk/efo/
56 stars 13 forks source link

Synonym and disease gaps in EFO #1241

Closed d0choa closed 3 years ago

d0choa commented 3 years ago

As part of an Open Targets project in collaboration with EPMC, we have processed the entire corpus to detect diseases or phenotypes using NER. In a subsequent step, we ground each label to its corresponding term in EFO using all available names and synonyms.

This process has identified several gaps in EFO. Labels that are frequently used in the literature, but are not present in EFO either in the form of terms or as synonyms.

I have curated the first set of highly frequent labels for your consideration and divided them into the next 2 groups:

  1. Labels with available EFO terms for their inclusion as synonyms - 98 terms accounting for 4.7M literature references in up to 897k publications.
Label occurrences Distinct PMIDs Label EFO name EFO ID
281184 29040 T2DM type II diabetes mellitus EFO_0001360
214329 23081 BC breast cancer MONDO_0007254
213784 19137 NAFLD non-alcoholic fatty liver disease EFO_0003095
168751 15452 T2D type II diabetes mellitus EFO_0001360
166389 21443 UC ulcerative colitis EFO_0000729
153989 10065 MetS metabolic syndrome EFO_0000195
140109 21681 AMI acute myocardial infarction EFO_0008583
137843 16523 MDD major depressive disorder MONDO_0002009
126968 43577 HIV HIV infection EFO_0000764
122979 9840 TNBC Triple-negative breast cancer EFO_0005537
115053 12851 VTE venous thromboembolism EFO_0004286
107246 8324 PDAC pancreatic ductal adenocarcinoma EFO_0002517
104390 23693 IR insulin resistance EFO_0002614
90575 9256 T1D type I diabetes mellitus EFO_0001359
88229 8424 OSCC oral squamous cell carcinoma EFO_0000199
85396 20998 SCC squamous cell carcinoma EFO_0000707
70477 38258 autoimmunity autoimmune disease EFO_0005140
65469 8023 OS osteosarcoma EFO_0000637
64799 11736 rabies Rhabdoviridae infectious disease EFO_0007469
60836 10617 CIN cervical intraepithelial neoplasia MONDO_0022394
58922 16418 HBV infection hepatitis B virus infection EFO_0004197
57311 10697 AR allergic rhinitis EFO_0005854
52704 6803 T1DM type I diabetes mellitus EFO_0001359
51090 4985 CDI clostridium difficile infection EFO_0009130
50716 8927 OC ovarian cancer MONDO_0008170
48762 12803 H. pylori infection Helicobacter pylori infectious disease EFO_1000961
44065 6972 PTC papillary thyroid carcinoma EFO_0000641
43628 4463 EOC ovarian carcinoma EFO_0001075
42817 8579 CS Cowden syndrome Orphanet_201
42669 4719 VL visceral Leishmaniasis EFO_0005045
42328 10782 cholera Vibrio infectious disease EFO_1001235
42163 6352 CHB Congenital heart block Orphanet_60041
42163 6352 CHB chronic hepatitis B virus infection EFO_0004239
41825 10433 SAH subarachnoid hemorrhage EFO_0000713
41807 12310 HPV infection human papilloma virus infection EFO_0001668
41596 28921 neurodegenerative disorders neurodegenerative disease EFO_0005772
39432 11808 CoV-2 infection COVID-19 MONDO_0100096
38651 14778 iron deficiency Iron deficiency anemia HP_0001891
37339 22242 hepatitis C hepatitis C virus infection EFO_0003047
36992 6792 HS hidradenitis suppurativa EFO_1000710
36992 6792 HS hippocampal sclerosis of aging EFO_0005678
36992 6792 HS histiocytic sarcoma MONDO_0019479
36913 4844 PTB pulmonary tuberculosis EFO_1000049
35904 20295 cognitive dysfunction cognitive disorder EFO_1001457
35279 4599 GD Gaucher disease Orphanet_355
35117 9776 DVT deep vein thrombosis EFO_0003907
34963 5317 LBP Low back pain HP_0003419
34809 6760 NET neuroendocrine neoplasm EFO_1001901
33812 4431 AOM Otitis media EFO_0004992
32514 2831 IgAN IGA glomerulonephritis EFO_0004194
30655 9762 SCI Spinal cord injury EFO_1001919
29510 17290 renal dysfunction impaired renal function disease MONDO_0001343
29066 11106 HCV infection hepatitis C virus infection EFO_0003047
28706 13395 hyperglycaemia Hyperglycemia HP_0003074
25718 5123 HH hypogonadotropic hypogonadism MONDO_0018555
25538 4379 IE infective endocarditis MONDO_0000565
22675 3172 OAB overactive bladder EFO_1000781
22648 3292 EAC Esophageal adenocarcinoma EFO_0000478
22168 5702 EA Esophageal atresia HP_0002032
22096 4304 VAP Ventilator-associated pneumonia EFO_1001865
22092 13932 angina angina EFO_0003913
21712 2897 FM fibromyalgia EFO_0005687
21028 2332 HAT human african trypanosomiasis EFO_0005225
19964 4562 TLE temporal lobe epilepsy EFO_0000773
19886 2609 MCC Merkel cell skin cancer EFO_1001471
19709 3635 SUD substance-related disorder MONDO_0002494
19667 2671 UI Urinary incontinence HP_0000020
19502 1920 SCZ schizophrenia EFO_0000692
19167 6692 malaria infection malaria EFO_0001068
18310 1482 DKD diabetic nephropathy EFO_0000401
18284 3163 AUD alcohol abuse MONDO_0002046
17895 1616 GBC gallbladder cancer MONDO_0005411
17857 2371 ALF Acute hepatic failure HP_0006554
17335 6266 cardiac fibrosis Myocardial fibrosis HP_0001685
17257 2258 MPM malignant pleural mesothelioma EFO_0000770
17176 12216 psychiatric illness psychiatric disorder MONDO_0002025
17157 10823 cardiac dysfunction heart failure EFO_0003144
17053 14219 neurodegenerative disorder neurodegenerative disease EFO_0005772
16992 4000 RSV infection Respiratory Syncytial Virus Infection EFO_1001413
16484 3218 BA asthma EFO_0000270
16484 3218 BA Biliary atresia Orphanet_30391
16409 1544 LSCC laryngeal squamous cell carcinoma EFO_0006352
15810 2621 GHD Growth hormone deficiency MONDO_0000050
15410 2728 PKD polycystic kidney disease MONDO_0020642
15283 3964 PsA psoriatic arthritis EFO_0003778
15091 9869 coronavirus disease coronavirus infectious disease EFO_0007224
15026 1927 LUAD lung adenocarcinoma EFO_0000571
14881 3878 ATL adult T-cell leukemia/lymphoma MONDO_0019471
14773 1609 OLP oral lichen planus EFO_0008517
14344 1197 HZ Herpes Zoster EFO_0006510
12809 2395 DENV infection dengue disease EFO_0005547
12746 4832 haemophilia hemophilia Orphanet_448
12681 5664 influenza virus infection influenza EFO_0007328
12661 5009 TB infection tuberculosis Orphanet_3389
12572 6780 reflux gastroesophageal reflux disease EFO_0003948
11836 2856 T. gondii infection toxoplasmosis EFO_0007517
10078 1630 FGR fetal growth restriction EFO_0000495
9843 4926 POP pelvic organ prolapse EFO_0004710
  1. Labels for which I didn't find an equivalent EFO term and might require imports or further curation - 35 terms accounting for >1M occurrences in up to 358k publications.
Label occurrences Distinct PMIDs Label Likely label Likely term Note
97169 14552 MCI Mild cognitive impairment To import??? Related to HP_0100543
79385 28194 psychological distress ?? to import. Related to EFO_0009095
59747 19788 viremia ?? to import -> HP_0020071
49764 28824 skin lesions ??Import e.g. NCIT_C158524
40587 6834 CRS Cytokine Release Syndrome ?? Import -> NCIT_C78251
40587 6834 CRS cytoreductive surgery ?? Import -> NCIT_C132068
39633 4823 POAG Primary open angle glaucoma ?? Import -> Related to EFO:0004190 and EFO:1001506
37077 18077 renal impairment ?? import -> NCIT_C114592
36992 6792 HS Hemorrhagic Stroke ?? import
33985 4651 CRPC Castration-Resistant Prostate Carcinoma ?? import -> Castration-Resistant Prostate Carcinoma - NCIT:C130234
32325 4425 CIA Chemotherapy-induced anemia ?? import
32274 23047 arterial hypertension ?? import
30684 9132 TBI traumatic brain injury ?? import --> Related to EFO_0011023
28954 3111 HFpEF heart failure with preserved ejection fraction ?? import
28137 3078 ILI influenza-like illness ?? import related to AEOAE:0000100 and SNOMED:95891005
26715 17171 impaired glucose tolerance impaired glucose tolerance HP_0040270 ?? import
24548 8775 anaphylaxis anaphylaxis NCIT_C107101 ?? Import
23507 15313 septicemia septicemia NCIT_C3364 ?? import
22234 4692 varicocele varicocele MONDO_0001498 ?? import
21527 13705 nosocomial infections nosocomial infection MONDO_0043544 ?? Import
20607 10948 effusion effusion NCIT_C3003 ?? to import
19929 3281 alexithymia alexithymia MONDO_0000661 ?? import
19723 9249 miscarriage Spontaneous abortion HP_0005268 ?? import
18940 9596 hepatic encephalopathy Hepatic Encephalopathy NCIT_C79596 ?? import
18654 5467 IGT impaired glucose tolerance HP_0040270 ?? import - Mentioned before as separate mention
17820 12916 left ventricular dysfunction left ventricular dysfunction NCIT_C50629 ?? to import
17277 8581 dyslipidaemia Dyslipidemia HP_0003611 ?? to import
17094 2837 CAC coronary artery calcium measurement ?? import SNOMED450734004
17047 10685 wound infection wound infection NCIT_C45234 ?? import
16412 7907 endotoxemia Endotoxemia OMIT_0019490 ?? import
15793 7448 hemorrhagic stroke Hemorrhagic Stroke ?? import - see more above (HS)
15737 12087 traumatic brain injury traumatic brain injury ?? import see more above TBI --> Related to EFO_0011023
15214 7488 delusions Delusions HP_0000746 ?? import
15194 4733 SSI Surgical Site Infection NCIT_C112019 ?? import
10289 3744 gonorrhoea ?? import

Happy to provide more detail about our findings.

paolaroncaglia commented 3 years ago

Great extraction + curation effort! Some thoughts to optimise incorporation of results, all focused on the first list (suggested synonyms):

d0choa commented 3 years ago

Regarding the first point, it's quite common for dictionaries to contain ambiguous synonyms. For example, we deal with gene/protein dictionaries in which some synonyms (e.g. p55) can refer to multiple instances. That's not a problem of the ontology/dictionary, which captures a reality on the semantic space.

Not including ambiguous synonyms will prevent users to ground labels to EFO. Our take is that the problem of de-ambiguating synonyms or acronyms should be a downstream process of specific applications not a feature of the ontology. This is an already widespread issue in EFO and other dictionaries so we are setting up systems to handle them. More info.

On a separate note, separating acronyms from synonyms would be highly beneficial for us as well. I know this has been mentioned in the past, so I'm just reiterating that we will have an application for it.

zoependlington commented 3 years ago

I have added all EFO/Orphanet acronyms as synonyms for now until we find a better way to represent these separately.

I have also imported all HP and Mondo terms in the second table into EFO. However, currently we are not able to import NCIT and OMIT, so if these and any without a suggested term are required we will need a new term request for these. I can set up a template for a bulk request of these if needed.