clulab / reach

Reach Biomedical Information Extraction
Other
96 stars 39 forks source link

Improved ahead-of-time annotator `AnnotationsCLI` #769

Closed enoriega closed 2 years ago

enoriega commented 2 years ago

Improved the entry point to annotate and serialize input files with BioNLPProcessor such that the serialized file can be loaded and the grammar applied without repeating processors' annotations pipeline.

Two main changes:

  1. Replaced the vanilla serializer with DocumentSerializer
  2. Decoupled the hybrid NER steps: This is necessary because our custom KB ner is part of BioNLPProcessor.annotate. Saving the KB NER annotations will lock the second step with the state of the KB files at annotation time. Effectively defeating the purpose of caching the annotations to apply new rules later on. Instead, the CRF NER runs at annotation time and the KB NER at "grammar time". I achieved this with some refactorization of BioNLPProcessor and ReachSystem.
enoriega commented 2 years ago

With annotation time I mean: pre-processing time With grammar time I mean: apply the rules and extract the mentions using the pre-processed annotations

kwalcock commented 2 years ago

@enoriega, I hope to look at it tomorrow. Did you notice the test failing? Some output is below and it looks vaguely familiar. Some project will fail if two tests (like for two versions of Scala) run in parallel and try to access the same port at the same time. It could have been this one. I think the work around was to just try again. I'll refrain from rerunning the tests, though, until you take a look.

00:44:29.190 [apiserver-akka.actor.default-dispatcher-2] INFO  o.c.reach.export.server.ApiService - Server online at http://localhost:8080/
00:44:29.243 [apiserver-akka.actor.default-dispatcher-6] ERROR akka.io.TcpListener - Bind failed for TCP channel on endpoint [localhost/127.0.0.1:8080]
java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:433)
    at sun.nio.ch.Net.bind(Net.java:425)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    at akka.io.TcpListener.liftedTree1$1(TcpListener.scala:56)
    at akka.io.TcpListener.<init>(TcpListener.scala:53)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
MihaiSurdeanu commented 2 years ago

Also, why did you separate the ruleNER from the others?

enoriega commented 2 years ago

@MihaiSurdeanu I defered the rule NER because I need it to run after serialization such that changes in the KB files will apply after the serialization.

Otherwise, the entities will be frozen in the state before serialization and any changes on the KB and associated rules won't reflect if using serialized files as inputs

enoriega commented 2 years ago

I am aware of some modification tests failing. I will fix them soon.

enoriega commented 2 years ago

@enoriega, I hope to look at it tomorrow. Did you notice the test failing? Some output is below and it looks vaguely familiar. Some project will fail if two tests (like for two versions of Scala) run in parallel and try to access the same port at the same time. It could have been this one. I think the work around was to just try again. I'll refrain from rerunning the tests, though, until you take a look.

00:44:29.190 [apiserver-akka.actor.default-dispatcher-2] INFO  o.c.reach.export.server.ApiService - Server online at http://localhost:8080/
00:44:29.243 [apiserver-akka.actor.default-dispatcher-6] ERROR akka.io.TcpListener - Bind failed for TCP channel on endpoint [localhost/127.0.0.1:8080]
java.net.BindException: Address already in use
  at sun.nio.ch.Net.bind0(Native Method)
  at sun.nio.ch.Net.bind(Net.java:433)
  at sun.nio.ch.Net.bind(Net.java:425)
  at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
  at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
  at akka.io.TcpListener.liftedTree1$1(TcpListener.scala:56)
  at akka.io.TcpListener.<init>(TcpListener.scala:53)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

@kwalcock it looks like this is the only test still broken, but it passed when running locally. How can I trigger another run from jenkins?

kwalcock commented 2 years ago

I clicked the restart from stage button.

kwalcock commented 2 years ago

@enoriega,

[info] TestBioNLPProcessor:
[info] BioNLPProcessor
[info] - should recognize some tricky entity names *** FAILED *** (1 second, 479 milliseconds)
[info]   "B-[Gene_or_gene_product]" was not equal to "B-[Family]" (TestBioNLPProcessor.scala:29)
[info] BioNLPProcessor
[info] - should recognize correct NEs in text 1 *** FAILED *** (6 milliseconds)
[info]   "[O]" was not equal to "[B-Gene_or_gene_product]" (TestBioNLPProcessor.scala:45)
[info] - should recognize correct NEs in text 2 *** FAILED *** (62 milliseconds)
[info]   "B-[Gene_or_gene_product]" was not equal to "B-[Site]" (TestBioNLPProcessor.scala:84)
[info] - should recognize correct species in text 3 *** FAILED *** (6 milliseconds)
[info]   "B-[Gene_or_gene_product]" was not equal to "B-[Species]" (TestBioNLPProcessor.scala:107)
[info] - should recognize protein families (3 milliseconds)
[info] - should fixup POS tags correctly for bio-verbs (51 milliseconds)
[info] - should fixup POS tags correctly for reverse bio-verbs (115 milliseconds)
[info] - should Correct the tags of His and Pro to NN (2 milliseconds)
[info] - should not annotate NEs from the stop list (3 milliseconds)
[info] - should not annotate NEs from the stop list when upper initial (2 milliseconds)
[info] - should tokenize complexes correctly (13 milliseconds)
[info] - should override named entities *** FAILED *** (2 milliseconds)
[info]   "B-[Gene_or_gene_product]" was not equal to "B-[Simple_chemical]" (TestBioNLPProcessor.scala:289)
[info] - should recognize 'Smad 2' as a protein (2 milliseconds)
[info] - should recognize 'Smad' as a family *** FAILED *** (2 milliseconds)
[info]   "B-[Gene_or_gene_product]" was not equal to "B-[Family]" (TestBioNLPProcessor.scala:307)
[info] - should not label XREFs as entities (4 milliseconds)
[info] - should label E2F as a family *** FAILED *** (2 milliseconds)
[info]   "B-[Gene_or_gene_product]" was not equal to "B-[Family]" (TestBioNLPProcessor.scala:323)
kwalcock commented 2 years ago

P.S. It looks like that was the test that failed in the previous run rather than the test with the port. I hadn't checked before clicking.

kwalcock commented 2 years ago

@enoriega, I'll get the jenkins password to you at the next convenient time so that you can trigger easier.

kwalcock commented 2 years ago

@enoriega, it could be that the failing test just changed the timing a little so that the port conflict was much more likely to occur. Whatever the reason, I'll take the win. Thanks for your patience.

enoriega commented 2 years ago

Thanks @kwalcock . So, I see the server still fails, but the tests pass:

20:36:02.324 [apiserver-akka.actor.default-dispatcher-3] INFO  o.c.reach.export.server.ApiService - Server online at http://localhost:8080
20:36:02.370 [apiserver-akka.actor.default-dispatcher-5] ERROR akka.io.TcpListener - Bind failed for TCP channel on endpoint [localhost/127.0.0.1:8080]
java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:433)
    at sun.nio.ch.Net.bind(Net.java:425)

We will have to see how to fix this, but that will be for another time