PathwayCommons / grounding-search

A biological entity grounding search service
MIT License
8 stars 5 forks source link

Expand entity support: complexes & families #130

Closed jvwong closed 1 year ago

jvwong commented 1 year ago

Summary

Goal: Extend grounding support for complexes & families. Why: Second largest class of Biofactoid entity grounding errors. Ubiquitous in the literature (protein gene - 56%; Family/complex - 17.7%). Been tabled for years. How: Re-use FamPlex, a curated resource for disambiguation of (human) complexes and families.

Background

It is common for researchers to refer to complexes and members of a family. These authors may not be concerned or possibly even aware of the precise individual component(s) of a complex or member(s) of a family to which they refer, but rather wish to convey information about a general class of function or structure. The result is that authors name entities using these broader terms, with the implicit assumption that there are individual components/members.

Example: NF-κB

Nuclear factor kappa-light-chain-enhancer of activated B cells (NF-κB) is a protein complex that controls transcription of DNA, cytokine production and cell survival.

There are five proteins in the mammalian NF-κB family

Class Protein Aliases Gene
I NF-κB1 p105 → p50 NFKB1
I NF-κB2 p100 → p52 NFKB2
II RelA p65 RELA
II RelB   RELB
II c-Rel   REL

Various NF-κB complexes

Fig. 1. A general model by which different NF-κB dimers contribute to selectivity of the transcriptional response to an NF-κB-inducing stimulus. The model shown is based on published studies of the selective functions of different NF-κB dimers, as discussed in the text. [Smale, S. T. Dimer-specific regulatory mechanisms within the NF-κB family of transcription factors. Immunol Rev 246, 193–204 (2012).]

Screen Shot 2023-03-24 at 10 39 44 AM

In Biofactoid

Complexes and families represent the second largest class of errors in entity grounding. See https://github.com/PathwayCommons/factoid/discussions/1003#discussioncomment-4268282.

Example

Screen Shot 2023-03-24 at 10 45 28 AM

NF-κB-p62-NRF2 survival signaling is associated with high ROR1 expression in chronic lymphocytic leukemia. Sanchez-Lopez et al. Cell Death Differ. 2020 Jul;27(7):2206-2216

Screenshot 2023-04-03 at 12 05 21 PM

Phosphorylated RB Promotes Cancer Immunity by Inhibiting NF-κB Activation and PD-L1 Expression. Mol Cell . 2019 Jan 3;73(1):22-35.e6.

Implementation

FamPlex is a resource that helps improve named entity recognition, grounding, and relationship resolution. The repository provides several comma-separated files that can be used to populate our grounding resource (Table I).

Table I. Relationship between FamPlex data and ground-search fields FamPlex file. Description Count ground-search field(s)
entities.csv FamPlex namespaced entities 754 name/id.
descriptions.csv Description text 431 summary
grounding_map.csv Synonyms 2163 (FamPlex) synonyms
equivalences.csv Mappings 2489(FamPlex) xrefs
relations.csv Components, members 4711 ?type?

Top entities referenced

Rank Name Count
1 ERK 6301
2 AKT 5839
3 NFkappaB 5768
4 TGFB 2877
5 PI3K 2486
6 JNK 2401
7 p38 2345
8 VEGF 2326
9 Cyclin 2087
10 Wnt 1622
11 Integrins 1498
12 RAS 1402
13 Actin 1299
14 PKC 1234
15 PKA 1058

Caveats

References

jvwong commented 1 year ago

Entity type

When it comes to integrating with factoid model, it's important to assign some sort of 'type' to a Famplex entity. To attempt this, will we use the Famplex provided relations.csv containing:

A first attempt is to follow a simple heuristic: A complex namedComplex is an entity that has some other entity that is partof it. Otherwise, it is a family.


Example below for "AMPK":

AMPK

jvwong commented 1 year ago

Turning this into an itemized issue, to stage the changes and reduce risk (in particular, changes in factoid):