clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Issues with grounding protein families #40

Closed bgyori closed 8 years ago

bgyori commented 9 years ago

When processing PMC4099524, a paper on Ras/Raf signaling in human cancer, Ras is grounded as Uniprot:A0PDV5, or Rosmarinate synthase, Plectranthus scutellarioides (Coleus). The confusion might come from the fact that Ras is a protein family (which includes HRAS, KRAS and NRAS) and therefore it does not appear in Uniprot in this form. However, there are protein family databases, for instance, PFAM, which does have the right entry for Ras: http://pfam.xfam.org/family/PF00071 The same issue arises with Erk (again a protein family), which is grounded to the incorrect Uniprot:P29323. Similarly Raf, which in this form refers to a protein family is grounded to the more specific RAF1 in Uniprot:P04049.

hickst commented 9 years ago

Thanks for the report. This issue has come up several times before so we are aware of it and are working to improve the situation. Unfortunately, there are several factors that are currently preventing a truly satisfactory solution. First, the Uniprot KB that we are using does contain several entries for the string "RAS" (and, from your report, it looks like we are resolving to the first one):

RAS Coleus A0PDV5 RAS Colletotrichum trifolii O42785 ras Common dab Q91079 RAS Cryptococcus neoformans var. neoformans serotype D (strain B-3501A) P0CQ43 RAS Cryptococcus neoformans var. neoformans serotype D (strain JEC21 / ATCC MYA-565) P0CQ42

Secondly, we currently check proteins before protein families, so we resolve to the more specific case. If our resolution process were able to reach the protein families, it would more correctly resolve to the InterPro entry IPR020849: http://www.ebi.ac.uk/interpro/entry/IPR020849

Several things that we are actively working on should help this issue. We are working on adding "context" to the system, which includes species information that could be used to do a better job of grounding. We are also currently working on a rewrite of the grounding system, which was originally intended only as a temporary (stop-gap) program. We hope to have all of this integrated for a release in the next few weeks.

MihaiSurdeanu commented 9 years ago

Ben, Thanks for the report. Two questions for you:

  1. Do you think it's preferable to resolve protein families before proteins? As Tom mentioned, our thinking was to attempt to resolve to the more specific knowledge base (Uniprot) first.
  2. Do you know if PFAM is better then InterPro for the resolution of protein families? We are currently using the latter, but it would be easy to change.
bgyori commented 9 years ago

1) Unfortunately, I don't think flipping the priority of protein family vs protein will completely solve the problem in general. Although in the concrete cases I mentioned, this would actually work. This is because, for instance, BRAF, MEK1 or HRAS will not match any protein family in InterPro, whereas Raf, Mek or Ras actually matches multiple out of context proteins in Uniprot.

Having said that, the long term solution will have to be based on collecting "clues" from context. As a human, it is obvious which sense these terms are used in in this particular paper, and in principle it should be possible to figure out automatically which grounding to prefer. On top of this, there is one more difficulty, namely, choosing which organism to assign a particular protein to, for instance human, mouse, chicken, etc. versions of BRAF are all in Uniprot and again the choice between these has to be informed by context.

2) I think InterPro is perfectly fine, I don't have a strong preference between it and PFAM.

MihaiSurdeanu commented 8 years ago

Thanks Ben!

Can you send us a few examples of such clues that the human reader uses? I'd like to include those in our system, if we can.

bgyori commented 8 years ago

I think the strongest anchor humans use is prior knowledge about pathways. For instance, in the title, it says "Raf/Mek/Erk pathway". We immediately know that those proteins (or rather protein families) form one of the most widely studied signaling cascades. So while Mek may resolve to other proteins in Uniprot, in the context of Raf and Erk it can be narrowed to a single obvious choice.

The next thing would be statements in the text about the role of a given entity. The first sentence says "The Raf/Mek/Erk signal transduction pathway is the best studied of the four mitogen-activated protein kinase (MAPK) cascades present in vertebrates", so from this we know that Raf, Mek and Erk are mitogen-activated protein kinases. Another example would be something like "the receptor MET", then this would indicate that one should look for a receptor.

A more global information would be the overall theme of the paper. This paper, for instance, makes many statements about signaling, kinases, human cancer, cancer therapy, cancer driving mutations, etc. So Raf, Mek and Erk are likely entities that are associated with these terms.

MihaiSurdeanu commented 8 years ago

Perfect. Thank you! Mihai

hickst commented 8 years ago

We should be using PFAM IDs for family grounding.

hickst commented 8 years ago

Changed Reach to use PFAM for identifying protein families in pull request #158