clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Re-evaluate & refactor dash patterns in grounding. #376

Closed MihaiSurdeanu closed 7 years ago

MihaiSurdeanu commented 7 years ago

We currently tag "Oct-4", "Sep-4", and "May-4" (maybe others too) as entities. Grounding finds some id as well. At least for "May-4" it is incorrect (it uses the string "4").

johnbachman commented 7 years ago

Not sure if this applies here, but one infamous problem is that the gene Oct4 (https://en.wikipedia.org/wiki/Oct-4) gets turned into a date by Microsoft Excel (trying to be helpful). It's possible that the curated data used to train the NER suffered from this problem. Though I'm not sure where 'Sep-4' and 'May-4' would come from.

hickst commented 7 years ago

Oct-4 is a legitimate protein found in mice, pigs, monkeys,and humans (Q01860). The Oct- family includes at least a dozen other proteins whose names would also fit this date pattern.

Searches of our KBs show that Oct- pattern is also a prefix string on (60) entries in the PubChem, ChEBI, HGNC, and HMDB (so if it parsed out as a date, those strings would presumably then fail to correctly match?)

Searches of our KBs for analogous month-date patterns for real entities find Mar, Apr, Aug, Sep, Nov, and Dec (6/12) all have this issue.

Given the prevalence of these valid protein name patterns and the uncommon nature of this date format, I advise that we err of the side of finding these few dates as proteins.

MihaiSurdeanu commented 7 years ago

I tend to agree. But why is the string "4" used to ground "May-4"?

hickst commented 7 years ago

That's an odd one but there is apparently a gene named '4' listed in Uniprot (also listed in HGNC gene KB) and since we use gene names as synonyms for protein names, we have 32 entries in our Uniprot KB; for proteins from this gene.

MihaiSurdeanu commented 7 years ago

Ok. But we shouldn't use "4" to ground "May-4"...

hickst commented 7 years ago

I see your point since the parser is not identifying the '4' as a separate entity. It turns out that this is being grounded as a protein because of yet another of our heuristics: we have a pattern which grounds proteins IF they occur as the right hand side of a dashed conjunction (May-4 matches) and otherwise assumes the LHS is a mutant spec:

    return text match {
      // check for RHS protein domain or LHS mutant spec: return protein portion only
      case HyphenatedNamePat(lhs, rhs) => if (isProteinDomain(rhs)) lhs else rhs

I can think of 3 different ways to handle this. Let's discuss it at the weekly meeting.

MihaiSurdeanu commented 7 years ago

Eliminate 4 from Uniprot for now.

hickst commented 7 years ago

I did some more checking and, sadly, this won't work as there are about 96 gene/protein names which are just numbers (in Uniprot Proteins KB alone). I think our LHS-RHS heuristic has just been OBE (overcome by events). We need to revisit its purpose(s) and update the code.

MihaiSurdeanu commented 7 years ago

Can you please list those genes here? If they are from species we don't care about, can we remove them all?

On Fri, Sep 30, 2016 at 2:16 PM, Tom Hicks notifications@github.com wrote:

I did some more checking and, sadly, this won't work as there are about 96 gene/protein names which are just numbers. I think our LHS-RHS heuristic has just been OBE (overcome by events). We need to revisit its purpose(s) and update the code.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/376#issuecomment-250854329, or mute the thread https://github.com/notifications/unsubscribe-auth/ABH-zkJiXIweZQEokt0pKJZQZtG22H-8ks5qvXwngaJpZM4KG1aW .

hickst commented 7 years ago

Remove all 96 genes? They are responsible for 1671 entries in the uniprot KB. While only 22 of those entries explicitly list Human, almost all of the others are proteins found in viruses, including some important human viruses (e.g., HHV-3, Varicella-zoster virus) and at least one virus related to cancer (Herpesvirus saimiri). The majority of the entries relate to bacteriophages but those types of viruses have important medical applications in diagnostics and testing, wound treatment, anti-bacterial food treatements, proteomic studies, and as model organisms in evolutionary studies.

I don't think removing the genes is the correct way to handle this. I think we need to look at whether this heuristic is still needed and, if so, what we need it to accomplish (i.e. revisit its purpose and update or replace it).

MihaiSurdeanu commented 7 years ago

Ok. Then we should understand when this grounding heuristic is triggered. Can you please run Reach (or the NER + grounding) and flag the entities that are grounded after this heuristic is applied?

I suspect we introduced these rules to extract exactly the prefixes identified by Harvard. So it may no longer be needed.

On Fri, Sep 30, 2016 at 4:58 PM, Tom Hicks notifications@github.com wrote:

Remove all 96 genes? They are responsible for 1671 entries in the uniprot KB. While only 22 of those entries explicitly list Human, almost all of the others are proteins found in viruses, including some important human viruses (e.g., HHV-3, Varicella-zoster virus) and at least one virus related to cancer (Herpesvirus saimiri). The majority of the entries relate to bacteriophages but those types of viruses have important medical applications in diagnostics and testing, wound treatment, anti-bacterial food treatements, proteomic studies, and as model organisms in evolutionary studies.

I don't think removing the genes is the correct way to handle this. I think we need to look at whether this heuristic is still needed and, if so, what we need it to accomplish (i.e. revisit its purpose and update or replace it).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/376#issuecomment-250877923, or mute the thread https://github.com/notifications/unsubscribe-auth/ABH-zu6sCj69jORakzbTvXQL_M-w7nopks5qvaIqgaJpZM4KG1aW .

hickst commented 7 years ago

I think that is part of what it did. Another part is that it claims to identify proteins tagged with their domains as suffixes (e.g. Nck-SH3). A third use-case is the identification of <mutant>-<protein> lexical conjunctions which, according to Dane, may no longer occur because of changed tokenization. Anyway, I will look more into what is actually being matched by this heuristic pattern.

hickst commented 7 years ago

A run on the DR3 papers shows that every one of the strings which matched the LHS-RHS pattern was an example of case (2): extracting the protein from a <protein>-<domain> string.

HNP: text: 'Raf1-RBD', rhs: 'RBD'
HNP: text: 'Raf1-RBD', rhs: 'RBD'
HNP: text: 'Raf1-RBD', rhs: 'RBD'
HNP: text: 'ASPP1-RAS', rhs: 'RAS'
HNP: text: 'SAF-CAT', rhs: 'CAT'
HNP: text: 'SAF-CAT', rhs: 'CAT'
HNP: text: 'Dora-Ras', rhs: 'Ras'
HNP: text: 'Gal-Elk', rhs: 'Elk'
HNP: text: 'Gal-Elk', rhs: 'Elk'
HNP: text: 'Gal-Sap', rhs: 'Sap'
HNP: text: 'GST-RBD', rhs: 'RBD'
HNP: text: 'GST-RBD', rhs: 'RBD'
HNP: text: 'GST-RBD', rhs: 'RBD'

Another run, on the 100 summer evaluation papers, shows that 35 of 52 were protein domain suffixes. The remaining 17 instances should not have reached the heuristic as they were either F-actin (a protein polymer?) or instances of alpha-, beta-, or gamma-tubulin proteins (they passed through the rule correctly but were incorrectly grounded (they need to be added to the Override KB)).

MihaiSurdeanu commented 7 years ago

maybe we should build a list of domains, and mutation patterns, and only apply it in those cases?

On Fri, Sep 30, 2016 at 10:13 PM, Tom Hicks notifications@github.com wrote:

A run on the DR3 papers shows that every one of the strings which matched the LHS-RHS pattern was an example of case (2): extracting the protein from a - string.

HNP: text: 'Raf1-RBD', rhs: 'RBD' HNP: text: 'Raf1-RBD', rhs: 'RBD' HNP: text: 'Raf1-RBD', rhs: 'RBD' HNP: text: 'ASPP1-RAS', rhs: 'RAS' HNP: text: 'SAF-CAT', rhs: 'CAT' HNP: text: 'SAF-CAT', rhs: 'CAT' HNP: text: 'Dora-Ras', rhs: 'Ras' HNP: text: 'Gal-Elk', rhs: 'Elk' HNP: text: 'Gal-Elk', rhs: 'Elk' HNP: text: 'Gal-Sap', rhs: 'Sap' HNP: text: 'GST-RBD', rhs: 'RBD' HNP: text: 'GST-RBD', rhs: 'RBD' HNP: text: 'GST-RBD', rhs: 'RBD'

Another run, on the 100 summer evaluation papers, shows that 35 of 52 were protein domain suffixes. The remaining 17 instances should not have reached the heuristic as they were either F-actin (a protein polymer?) or instances of alpha-, beta-, or gamma-tubulin proteins (they passed through the rule correctly but were incorrectly grounded (they need to be added to the Override KB)).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/376#issuecomment-250893199, or mute the thread https://github.com/notifications/unsubscribe-auth/ABH-zgqGxROpX9OPmgPmkz71bT2DGxW4ks5qvewLgaJpZM4KG1aW .

MihaiSurdeanu commented 7 years ago

will be addressed once the Harvard prefixes are used.

hickst commented 7 years ago

I think the investigation and discussion above show that this will not be solved by the Harvard prefix enhancements because they are a new pattern involving the dash. None of the cases found in the test above were Harvard prefixes. We still need to re-evaluate and refactor the dash handling in grounding (simultaneous with stripping Harvard prefixes). I am renaming this task to reflect this additional work.

hickst commented 7 years ago

Closed by pull request #491