clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Update to processors 8.3.0 #742

Closed kwalcock closed 3 years ago

kwalcock commented 3 years ago
[error] Failed tests:
[error]     org.clulab.processors.TestBioNLPProcessor

[error] Failed tests:
[error]     org.clulab.reach.TestBindingEvents
[error]     org.clulab.reach.TestNERLabeling
[error]     org.clulab.reach.PolaritySuite
[error]     org.clulab.reach.TestDeModifications
[error]     org.clulab.reach.TestTemplaticSimpleDeEvents
[error]     org.clulab.reach.TestPolarity
[error]     org.clulab.reach.TestAmountEvents
[error]     org.clulab.reach.NegationTests
[error]     org.clulab.reach.TestActivationEvents
[error]     org.clulab.reach.TestCoreference
[error]     org.clulab.reach.TestModifications
[error]     org.clulab.reach.ExperimentalRegulationTests
[error]     org.clulab.reach.TestRegulationEvents
[error]     org.clulab.reach.TestOverrides
[error]     org.clulab.reach.TestTemplaticSimpleEvents
[error]     org.clulab.reach.TestConversionEvents

[error] Failed tests:
[error]     org.clulab.reach.export.TestApi
[error]     org.clulab.reach.export.TestOutputDegrader
kwalcock commented 3 years ago

I don't have a good idea why these tests are failing. They must be related to recent changes in the NER code or the embeddings code. Any suggestions out there?

MihaiSurdeanu commented 3 years ago

I’ll take a look soon.

On March 26, 2021 at 7:45:46 PM, Keith Alcock @.***) wrote:

I don't have a good idea why these tests are failing. They must be related to recent changes in the NER code or the embeddings code. Any suggestions out there?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/pull/742#issuecomment-808630168, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TR5HKW3NZ7EUJIT6WDTFVBFTANCNFSM4ZVYREKQ .

MihaiSurdeanu commented 3 years ago

@kwalcock, @enoriega: it seems that the change affected the behavior of the LexiconNER. I propose we focus on understanding this first. Here are two examples:

From org.clulab.reach.TestNERLabeling: Before 8.3.0, the sentence below marks "E3" as Simple-chemical:

TEXT:   Estrone E1 , estradiol E2 , and estriol E3 do not cause cancer .
TOKENS: (0,Estrone,NN), (1,E1,NN), (2,,,,), (3,estradiol,NN), (4,E2,NN), (5,,,,), (6,and,CC), (7,estriol,NN), (8,E3,NN), (9,do,VBP), (10,not,RB), (11,cause,VB), (12,cancer,NN), (13,.,.)
ENTITY LABELS: (Estrone,B-Simple_chemical), (E1,B-Simple_chemical), (,,O), (estradiol,B-Simple_chemical), (E2,B-Simple_chemical), (,,O), (and,O), (estriol,B-Simple_chemical), (E3,B-Simple_chemical), (do,O), (not,O), (cause,O), (cancer,O), (.,O)

After the change to 8.3.0, "E3" is marked as Family:

TEXT:   Estrone E1 , estradiol E2 , and estriol E3 do not cause cancer .
TOKENS: (0,Estrone,NN), (1,E1,NN), (2,,,,), (3,estradiol,NN), (4,E2,NN), (5,,,,), (6,and,CC), (7,estriol,NN), (8,E3,NN), (9,do,VBP), (10,not,RB), (11,cause,VB), (12,cancer,NN), (13,.,.)
ENTITY LABELS: (Estrone,B-Simple_chemical), (E1,B-Simple_chemical), (,,O), (estradiol,B-Simple_chemical), (E2,B-Simple_chemical), (,,O), (and,O), (estriol,B-Simple_chemical), (E3,B-Family), (do,O), (not,O), (cause,O), (cancer,O), (.,O)

Another one: In org.clulab.processors.TestBioNLPProcessor, before 8.3.0 "neurofibromin" was marked as a GGP:

TEXT:   We tested the level of neurofibromin present in the sample
TOKENS: (0,We,PRP), (1,tested,VBD), (2,the,DT), (3,level,NN), (4,of,IN), (5,neurofibromin,JJ), (6,present,JJ), (7,in,IN), (8,the,DT), (9,sample,NN)
ENTITY LABELS: (We,O), (tested,O), (the,O), (level,O), (of,O), (neurofibromin,B-Gene_or_gene_product), (present,O), (in,O), (the,O), (sample,O)

After 8.3.0, this entity is no longer recognized:

TEXT:   We tested the level of neurofibromin present in the sample
TOKENS: (0,We,PRP), (1,tested,VBD), (2,the,DT), (3,level,NN), (4,of,IN), (5,neurofibromin,JJ), (6,present,JJ), (7,in,IN), (8,the,DT), (9,sample,NN)
ENTITY LABELS: (We,O), (tested,O), (the,O), (level,O), (of,O), (neurofibromin,O), (present,O), (in,O), (the,O), (sample,O)

Note that in bioresources, we do not have "neurofibromin" as a GGP. We have "neurofibromin 1" and "neurofibromin 2". So technically, the latter behavior is correct. But I would like to understand what is happening here. @enoriega: can you please help? Can you please try to debug these two examples and understand what behavior changed? I am pretty sure it's in LexiconNER, but not sure what. Once we do, I suspect @kwalcock can fix it.

Thank you both!

MihaiSurdeanu commented 3 years ago

@enoriega : can you please prioritize this?

enoriega commented 3 years ago

@kwalcock @MihaiSurdeanu

For Estrone E1 , estradiol E2 , and estriol E3 do not cause cancer ., the change happened because there is a conflicting match in NER-Grounding-override.tsv with two matching entries, one for simple_chemical and another for family. This situation was already present, and before the change, the class CompactLexiconNER consistently chose simple chemical. After the change, it now choses family, and I was able to trace the culprit to the longest match function:

https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/sequences/CompactLexiconNER.scala#L69-L137

I'm not sure if the best way to fix it is to make it behave as before or to remove the conflicting entry in the overrides file.


For We tested the level of neurofibromin present in the sample, both NER components work correctly. The CRF labels neurofibromin as B-Gene-or-gene-product. The problem was introduced by a refactor of the following function, which overwrites the CRF's tag with the lexicon's (which is O, because it is no an entry in any KB) while attempting to merge both tag sets:

https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/sequences/LexiconNER.scala#L261-L278

processors' version used by the master branch of reach has a different implementation of mergeLabels

MihaiSurdeanu commented 3 years ago

Thanks @enoriega !

@kwalcock : has anything changed in these methods? why the different behavior?

kwalcock commented 3 years ago

I'll look at it very soon. @enoriega did excellent detective work. I suspect that I broke some things, especially in cases of ties or conflicts, although everything was supposed to have been taken into account.

kwalcock commented 3 years ago

The second one with We tested the level of neurofibromin present is an outright mistake, mine, and it's pretty bad. The line of code in LexiconNER.mergeLabels that was

src.copyToArray(dst, offset, notOutsideCount)

I thought to mean the same as

Array.copy(src, offset, dst, offset, notOutsideCount)

but instead it is

Array.copy(src, 0, dst, offset, notOutsideCount)

so that the wrong values are being copied into the destination. There will be an update and probably a warning not to use version 8.3.0. Master was affected for a time, but it looks like there was only the one official release.

MihaiSurdeanu commented 3 years ago

:)

kwalcock commented 3 years ago

This second issue with Estrone E1 , estradiol E2 , and estriol E3 do not cause cancer . does have to do with there being two entries in the overrideKB and I indeed changed the behavior there on purpose, but now that it looks like a problem, I'll say mistakenly. The explanation is quite long-winded. If this tie breaking is especially important it might be worth visiting more formally, though. I changed processors so that the tests pass, but that is just so that previous behavior is preserved, not because it's good behavior.

Long ago, before changes of 2/2/2019, code for the override KBs looked like

      overrideKBs.get.foreach(okb => {
        val reader = loadStreamFromClasspath(okb)
        val overrideMatchers = loadOverrideKB(reader, lexicalVariationEngine, caseInsensitiveMatching, knownCaseInsensitives)
        for(name <- overrideMatchers.keySet.toList.sorted) {
          val matcher = overrideMatchers(name)
          logger.info(s"Loaded OVERRIDE matcher for label $name. This matcher contains ${matcher.uniqueStrings.size} unique strings; the size of the first layer is ${matcher.entries.size}.")
          matchers += Tuple2(name, matcher)
        }
        reader.close()
      })

If I understand this correctly, the matchers were ordered by the overrideKBs (files) and then within each of those, alphabetically by label which is in the keySet. This may have just been a way to ensure consistency. Afterwards came the regular KBs in the order they were specified. Each NE could appear in numerous of the matchers, but matcher order determined the winner.

The version right after that which has been in production until recently used a single matcher which for each NE included a single number that was the index into an array of labels. There could only be one number. It would have been difficult to calculate the winning number to match previous behavior and I probably missed seeing the sorting anyway. The label first associated with an NE was used.

In the most recent version which is causing the problems, I relaxed that first label requirement and allowed overwriting if there was a duplicate in any override KB whose label would have come earlier in the list of regular KBs. That messes up the reach tests. The change might have been to better approximate original behavior, but I didn't realize that the modified behavior of the previous paragraph was already important.

There will be a processors PR momentarily.

MihaiSurdeanu commented 3 years ago

Thanks @kwalcock !

MihaiSurdeanu commented 3 years ago

In the second version, how were the labels assigned to the same NE sorted? That may be some arbitrary behavior not worth preserving...

Also, @enoriega: can you please remove the multiple labels for E3? Maybe keep chemical, since that is backwards compatible?

kwalcock commented 3 years ago

They were first come, first serve in the second, middle version. However, the queue was first all the overrideKBs in order, then the regular KBs. The overrideKBs refer to labels (names) of the regular KBs, so using an overrideKB one could more or less move an NE from one KB to a different one. Maybe that was meant by override. For the newer version I might have been thinking that the order of the regular KBs should determine the priority, not the line or file in an overrideKB that something appeared. That might make more sense for new entries. One overrides an otherwise closed KB by adding a new entry and it should have the same precedence as everything else in there.

IIRC correctly, I encountered many duplicates, although maybe not ones in the same files like E3. I could make a list.

MihaiSurdeanu commented 3 years ago

Ok, then it is important to preserve this functionality. That order has meaning, and it is what was meant by override.

kwalcock commented 3 years ago

If someone was to decide that the overrides were case sensitive, then things like Trp and TRP would not be redundant or conflicting. If anything is all lower case, then it gets involved in the known case insensitives, and removing them gets complicated. Nevertheless, here are the "duplicates":

eht-1864 akt axin camk2 cdc25 cdkn1 cox4 cox6b cox7a cox8 cyclin-a dgk dvl e3 erk fos g-alpha hsp70 hsp90 ifngr pp1c prkaa rps6ka raf rap1 ras shc smc1 thr trp vav wnt5 cadherin eif4a eif4g eotaxin porin thioredoxin (reductase) ubiquitin

kwalcock commented 3 years ago

Here is also a note to self. If instead of the CombinedLexiconNER, the SeparatedLexiconNER is used, a few of the tests will fail. I think these are tests that were written after the Combined version was put into use, so they depend on the per NE override. Previously the override priority was per file. So, for example, in NER-Grounding-Override.tsv there is now

Line 249 something Family Line 298 otherthing Site Line 333 Thr Site Line 756 THR Family

Because of line 249, the Separated one will give priority to Family because it saw that label first. The Combined version makes that decision on a per NE basis. Line 333 will cause Site to be used because the later Family will not overwrite it now. It had been allowing this overwrite, but that was at odds with the middle version from above, so that won't be done anymore.

kwalcock commented 3 years ago

This PR has been superseded by others, particularly one updating processors to 8.3.3. Some of the individual commits which might be useful will be transferred to a different PR.