clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Prefer human groundings in grounding phase 1 #110

Closed hickst closed 8 years ago

hickst commented 8 years ago

Need to sort returned candidates by human then by empty species, then by all other species.

myedibleenso commented 8 years ago

I'm not sure this is fixed:

Similarly , EHT1864 , a direct inhibitor of RAC but not CDC42 activation , dose-dependently inhibited AKT phosphorylation induced by LPA and S1P ( E ) , but not EGF , PDGF , or insulin ( F ) .

mention text: AKT
List(Gene_or_gene_product, MacroMolecule, BioChemicalEntity, BioEntity, Entity, PossibleController)
    ------------------------------
    Rule => ner-gene_or_gene_product-entities
    Type => CorefTextBoundMention
    ------------------------------
    Protein|List(Gene_or_gene_product, MacroMolecule, BioChemicalEntity, BioEntity, Entity, PossibleController) => AKT
    grounding: KBResolution(akt, uniprot, P54644, dictyostelium discoideum)
    ------------------------------

uniprot entry for P54644: http://www.uniprot.org/uniprot/P54644 desired uniprot entry (P31749): http://www.uniprot.org/uniprot/P31749

Is this the same issue, or is something else causing it?

hickst commented 8 years ago

I believe this issue has been fixed. Uniprot does not list AKT as a human protein. Dicty is alphabetically the first of the candidate groundings from our uniprot protein KB. Candidates for the string AKT are: KBResolution(akt, uniprot, P54644, dictyostelium discoideum) KBResolution(akt, uniprot, Q8INB9, drosophila melanogaster) KBResolution(akt, uniprot, Q8INB9, fruit fly) KBResolution(akt, uniprot, P31750, mouse) KBResolution(akt, uniprot, P31750, mus musculus) KBResolution(akt, uniprot, P54644, slime mold)

Note that your 'desired' entry is for the protein AKT1, which IS in Uniprot as a human protein. Typing 'AKT1' into the ReachShell gives: mention text: AKT1 List(Gene_or_gene_product, MacroMolecule, BioChemicalEntity, BioEntity, Entity, PossibleController)

Rule => ner-gene_or_gene_product-entities
Type => CorefTextBoundMention
------------------------------
Protein|List(Gene_or_gene_product, MacroMolecule, BioChemicalEntity, BioEntity, Entity, PossibleController) => AKT1
grounding: KBResolution(akt1, uniprot, P31749, homo sapiens)
myedibleenso commented 8 years ago

Oh, I see. My mistake...from reading the description under "Function", it sounds like AKT is used synonymously with AKT1. I'm surprised there isn't such an entry in those we derived from the uniprot entries + listed synonyms.

hickst commented 8 years ago

I think you're encountering what makes these identifications so hard: the papers are using a variety of lexical synonymy: aliases, nicknames, hyponymy, metonymy, and (probably) meronymy. Our KBs only have, at best, some synonym strings sometimes.

bgyori commented 8 years ago

I think the correct answer here would be to ground AKT to a human protein family. Unfortunately, entries in the protein family databases don't usually correspond to what authors think about when they use non-specific protein names like "AKT", "RAF", "MEK" or "ERK". For AKT, the "correct" answer in my view would be to ground it to a protein family that resolves to the isoforms AKT1, AKT2 and AKT3. I know of one structured source that does this: http://resource.belframework.org/belframework/1.0/resource/protein-families.bel Here the entry PFH:"AKT Family" resolves to HGNC:AKT1, HGNC:AKT2, HGNC:AKT3.

hickst commented 8 years ago

Thanks Ben!....a very interesting resource. To use it, it seems like we would need the second half: the mapping of the individual proteins from their BEL designations to info about them (and maybe even to their corresponding Uniprot IDs). We will definitely think about how we can incorporate this KB but it may require a bit of new infrastructure to handle it.

bgyori commented 8 years ago

Great! This could be relevant for many other cases, for instance, RAS. Currently the system grounds RAS to IPR020849, which is correct but InterPro doesn't really tell you who the members of the family are. This makes downstream assembly/analysis difficult. Again, in this OpenBEL resource, it is clear RAS resolves to HRAS, KRAS and NRAS isoforms: p(PFH:"RAS Family") hasMembers list(p(HGNC:HRAS), p(HGNC:KRAS), p(HGNC:NRAS))

MihaiSurdeanu commented 8 years ago

Thanks Ben! This is very useful! We will try to integrate this soon. Mihai

On Fri, Mar 11, 2016 at 11:05 AM, Benjamin M. Gyori < notifications@github.com> wrote:

Great! This could be relevant for many other cases, for instance, RAS. Currently the system grounds RAS to IPR020849, which is correct but InterPro doesn't really tell you who the members of the family are. This makes downstream assembly/analysis difficult. Again, in this OpenBEL resource, it is clear RAS resolves to HRAS, KRAS and NRAS isoforms: p(PFH:"RAS Family") hasMembers list(p(HGNC:HRAS), p(HGNC:KRAS), p(HGNC:NRAS))

— Reply to this email directly or view it on GitHub https://github.com/clulab/reach/issues/110#issuecomment-195480859.