clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Entity boundary issue related to sites #722

Closed bgyori closed 3 years ago

bgyori commented 3 years ago

A very common entity boundary issue currently happens with cyclins, which can both be described as a family ("cyclin") or specific proteins like "cyclin A1". Currently, in many of these cases, the entity boundary is cropped to just "cyclin", even if a specific type like "A1" or "D1" follows.

Looking at bioresources, in hgnc.tsv.gz we have

cyclin A1»··P78396»·Human 

and cyclin A1 also appears in ner/Gene_or_gene_product.tsv.gz. Still, when putting cyclin A1 into the Reach shell, we get

>>> cyclin A1

sentence #0
TEXT:   cyclin A1
TOKENS: (0,cyclin,NN), (1,A1,NN)
ENTITY LABELS: (cyclin,B-Gene_or_gene_product), (A1,I-Gene_or_gene_product)

LEMMAS: cyclin a1
roots: 1
outgoing:
    0:
    1: (0,compound)
incoming:
    0: (1,compound)
    1:

ENTITIES: 2

MENTION TEXT:  cyclin
LABELS:        List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController)
DISPLAY LABEL: Protein
    ------------------------------
    RULE => ner-gene_or_gene_product-entities
    TYPE => CorefTextBoundMention
    ------------------------------
    GROUNDING: <KBResolution: Cyclin, fplx, Cyclin, human, <IMKBMetaInfo: uaz, NER-Grounding-Override.tsv.gz, , , sp=false, f=false, p=false>>

CONTEXT: NONE
    ------------------------------

MENTION TEXT:  A1
LABELS:        List(Site)
DISPLAY LABEL: Site
    ------------------------------
    RULE => site_1letter_a
    TYPE => CorefTextBoundMention
    ------------------------------
    GROUNDING: <KBResolution: A1, uaz, UAZ00001, , <IMKBMetaInfo: uaz, , , , sp=false, f=false, p=false>>

CONTEXT: NONE
    ------------------------------

EVENTS:   0

==================================================

So it appears that NER works correctly, since if I understand correctly, (cyclin,B-Gene_or_gene_product), (A1,I-Gene_or_gene_product) means that a single entity is detected, but it then gets broken up into a Protein and a Site. (One additional note: I checked whether this behavior has anything to do with the fact that "cyclin" is also defined as an override in NER-Grounding-Override.tsv.gz, but that doesn't seem to be the case, the same behavior happens if I remove the override).

MihaiSurdeanu commented 3 years ago

Thanks @bgyori ! I think this behavior was on purpose based on what we saw when we developed it. We observed a while ago that the NER "eats" into descriptions of sites because of the same overlap you observed. So we allowed the Site detector (which is executed downstream of the NER) to take away tokens from the NER if the site grammar thinks it's a site. I think I can revert that if we decide that's the preferred behavior. What do you think?

bgyori commented 3 years ago

I see, is there any documentation / communication from around that time that has examples of issues before that change? That might help figure out how to engineer an overall better solution.

MihaiSurdeanu commented 3 years ago

No... This was an early decision, which may not even be justified right now. Not sure. I think we could employ your tactic here: revert the change, and compare outputs before and after to understand the impact.

bgyori commented 3 years ago

Yes, I think in principle a good prioritization would be to first check if the named entity can be grounded fully as is (in the above example as "cyclin A1"). Then if it cannot be grounded, check if it can be broken up into a protein + site combination. I'm actually not sure what some examples of this latter case might be since in most cases e.g., "ERK2 T185" the site isn't recognized as part of the entity in the first place. In any case, we can try to check empirically if there are any issues with changing this.

Also, I couldn't immediately find where this is implemented in the code but I could think more about the issue if you can point me to it.

MihaiSurdeanu commented 3 years ago

I'll take a look at the code later today.

MihaiSurdeanu commented 3 years ago

Just FYI, if you look at this file: https://github.com/clulab/reach/blob/master/main/src/main/resources/org/clulab/reach/biogrammar/entities/entities.yml

you'll see that the rule that creates Site have priority 1, whereas these rules: https://github.com/clulab/reach/blob/master/main/src/main/resources/org/clulab/reach/biogrammar/entities/entities.yml#L129-L233

which convert BIO notations into actual Reach entities have priority 3, which means they are executed later. I'll play with these priorities to see what happens.

MihaiSurdeanu commented 3 years ago

There's a little more here than just priorities. More soon, hopefully.

MihaiSurdeanu commented 3 years ago

It was not a priority issue. It was this line: https://github.com/clulab/reach/blob/master/main/src/main/resources/org/clulab/reach/biogrammar/entities/entities.yml#L186

which prevents entities contain upper-case letters followed by digits. As the comment says this was added to prevent the entity to "eat" mutations. Removing the !word=... constraint it makes Reach recognize "cyclin A1" as an entity. Also, all tests continue to pass.

@bgyori: should we remove this constraint? Or adjust it in a meaningful way? Thanks!

bgyori commented 3 years ago

I see, I'm still wondering if there was a good reason for adding that condition but I couldn't come up with any examples, even synthetically where it would make a difference. For instance, if we have a typical mutation listed as a separate word, like BRAF V600E, the V600E part is not recognized as part of the named entity anyway and so this condition isn't applied. In other cases, where there is a named entity match for the entire sequence like Cyclin D1, the correct behavior is to not break that up. Could there be cases where NER extends the entity boundary to include the second word even if it is not part of an entity string explicitly listed in the NER files?

bgyori commented 3 years ago

Just for the record, some synthetic examples I've tried (not all of them make sense biologically) are e.g., KRAS G13, KRAS G13C, EGFR A1, EGFR C15, EGFR A66C, AKT1 A66, etc. There are a few different cases for entity labels, one is

(EGFR,B-Gene_or_gene_product), (Y1,B-Gene_or_gene_product)
(AKT1,B-Gene_or_gene_product), (A66,B-Simple_chemical)

where we get a new B- (not I-) entity for the site. Another is

(KRAS,B-Gene_or_gene_product), (G13C,O)

where the site is an O (again not I-). I couldn't produce any examples with entities tagged as B- followed by I- where the I- was a legitimate site for B-.

MihaiSurdeanu commented 3 years ago

This came up because in addition of the bioresources NER, which you know, Reach merges in the outputs of a CRF NER, which (at least in past) we have observed to "eat" into mutations and sites. But it doesn't seem to be happening commonly at all. So it's maybe safe to just remove? Should I do it?

bgyori commented 3 years ago

I'm trying to do some analysis on papers with mutations to see if we lose anything, I'll report what I find here.

bgyori commented 3 years ago

A larger scale test actually highlighted some consequences to this proposed change. Several instances of figure names were extracted as named entities e.g.:

I found that before the change, from the same sentences we still extracted

as a named entity but we didn't get figure, presumably because that is a stop word. Is there a way to recognize Fig. and figure as stopwords assuming they could come with a suffix like Fig. 5a and figure 4a?

MihaiSurdeanu commented 3 years ago

Can you please give us a couple of example sentences?

On Thu, Jan 14, 2021 at 19:51 Benjamin M. Gyori notifications@github.com wrote:

A larger scale test actually highlighted some consequences to this proposed change. Several instances of figure names were extracted as named entities e.g.:

  • Fig. 5a
  • Fig. 4f
  • figure 4A I found that before the change, from the same sentences we still extracted
  • Fig. as a named entity but we didn't get figure, presumably because that is a stop word. Is there a way to recognize Fig. and figure as stopwords assuming they could come with a suffix like Fig. 5a and figure 4a?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/722#issuecomment-760610463, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TU4S7RHRIOOF62H2FDSZ6UUXANCNFSM4V6TDIEA .

bgyori commented 3 years ago
MihaiSurdeanu commented 3 years ago

Thanks!

On Thu, Jan 14, 2021 at 20:06 Benjamin M. Gyori notifications@github.com wrote:

  • Similarly, we showed that wild-type p53 was polyubiquitinated by Pirh2 but not by Pirh2-DN and Pirh2-ΔRING (Fig. 5C, compare lane 3 with lanes 4 and 5).
  • In contrast, the levels of IRP2 and TfR1 were increased, whereas the level of FTH1 was decreased, by ectopic mutant p53 (Fig. 4f, compare lanes 3–4 with 1–2, respectively).
  • In addition, knockout of IRP2 led to decreased expression of TfR1 and increased expression of FTH1 (Fig. 5a), consistent with previous report [41].
  • MG132 treatment rescued the NSC59984-mediated down-regulation of mutant p53 (figure 4A).

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/722#issuecomment-760614989, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TRHFEJC2JJCJL7SAWTSZ6WKVANCNFSM4V6TDIEA .

MihaiSurdeanu commented 3 years ago

@enoriega, can you please:

Thanks!

enoriega commented 3 years ago

@MihaiSurdeanu adding figure and variants to the stop list file doesn't work for this problem:

If you consider figure 4A, the CRF assigns tags B- and I- to the tokens. The stop list is only checked for single word entities (https://github.com/clulab/reach/blob/master/processors/src/main/scala/org/clulab/processors/bionlp/BioNERPostProcessor.scala#L85-L89). At this point the grammar hasn't been called and 4A hasn't been detached yet.

I think the cleanest way to solve this is to discard the mention by encoding figure and variants into the rule itself.

What do you suggest?

MihaiSurdeanu commented 3 years ago

I fixed this a while ago in the NER post-processing code: https://github.com/clulab/reach/blob/master/processors/src/main/scala/org/clulab/processors/bionlp/BioNERPostProcessor.scala#L100-L107

But this obviously doesn't work anymore. Can you please try to fix that block of code? Thanks!

enoriega commented 3 years ago

Done. I noticed that after fixing it, the figure numbers (4a, 5C, etc) where recognized as sites. I updated the rule to fix this too.

https://github.com/clulab/reach/blob/4d7445bcd3cdd524abf58f571c9cc3a145d64785/main/src/main/resources/org/clulab/reach/biogrammar/entities/entities.yml#L33

Pull request coming soon.

enoriega commented 3 years ago

@MihaiSurdeanu I had to bump the bioresources dependency to version 1.1.37-SNAPSHOT. Should I leave this on for the pull request? Otherwise, what is the recommended course of action?

I see how this I inconvenient, and will propose a fix soon.

MihaiSurdeanu commented 3 years ago

Great, thanks!

reach master should not depend on any snapshot. So, we need to release bioresources first.

On Mon, Jan 18, 2021 at 10:17 AM Enrique Noriega notifications@github.com wrote:

@MihaiSurdeanu https://github.com/MihaiSurdeanu I had to bump the bioresources dependency to version 1.1.37-SNAPSHOT. Should I leave this on for the pull request? Otherwise, what is the recommended course of action?

I see how this I inconvenient, and will propose a fix soon.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/722#issuecomment-762378742, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TW4VUTU2J4JYLU36ITS2RULNANCNFSM4V6TDIEA .