clulab / reach

Reach Biomedical Information Extraction
Other
96 stars 39 forks source link

Problem with resource files in fat jar #759

Closed bgyori closed 2 years ago

bgyori commented 2 years ago

I am trying to create a fat JAR of the latest Reach using sbt assembly and then using it to process text with the ApiRuler's annotate_text method (this has been our standard integration approach, no change here). If I do this exactly from the Reach repo's main folder, it works, i.e. it is able to load all the resource files and return a result

[dynet] Loading DyNet from /tmp/libdynet_swig-2518909563452203147.so...
[dynet] random seed: 2843805941
[dynet] allocating memory: 512,512,512,512MB
[dynet] memory allocation done.
22:41:27.961 [main] INFO  o.c.p.m.DeepLearningPolarityClassifier - Loading saved model SavedLSTM_WideBound_u_tag ...
22:41:28.282 [main] INFO  o.c.p.m.DeepLearningPolarityClassifier - Loading model finished!
22:41:44.011 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
22:41:44.236 [main] INFO  e.s.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.2 sec].
22:41:44.249 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
22:41:44.285 [main] INFO  org.clulab.sequences.LexiconNER - Beginning to load the KBs for the rule-based NER...
22:41:44.303 [main] INFO  org.clulab.sequences.LexiconNER - Loaded OVERRIDE matchers for all labels.  The number of entries added to the first layer was 750.
22:41:44.304 [main] INFO  o.c.p.bionlp.BioNLPProcessor - Loading BioProcess...
22:41:45.838 [main] INFO  o.c.p.bionlp.BioNLPProcessor - Done. Read org.clulab.processors.bionlp.ner.ReachSingleStandardKbSource$$anon$1@7e27b77a.lineCount lines from bio_process.tsv
...

However, if I move to any other folder, loading the resource files fails when attempting to load the first resource file:

[dynet] Loading DyNet from /tmp/libdynet_swig-8493435714969305050.so...
[dynet] random seed: 2078540065
[dynet] allocating memory: 512,512,512,512MB
[dynet] memory allocation done.
22:41:00.169 [main] INFO  o.c.p.m.DeepLearningPolarityClassifier - Loading saved model SavedLSTM_WideBound_u_tag ...
22:41:00.492 [main] INFO  o.c.p.m.DeepLearningPolarityClassifier - Loading model finished!
22:41:16.736 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
22:41:16.973 [main] INFO  e.s.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.2 sec].
22:41:16.988 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
22:41:17.024 [main] INFO  org.clulab.sequences.LexiconNER - Beginning to load the KBs for the rule-based NER...
22:41:17.041 [main] INFO  org.clulab.sequences.LexiconNER - Loaded OVERRIDE matchers for all labels.  The number of entries added to the first layer was 750.
22:41:17.042 [main] INFO  o.c.p.bionlp.BioNLPProcessor - Loading BioProcess...

-> process exits here.

So I suspect the issue is with the path by which the bioresources are referred to in the context of a JAR file. In particular, I am wondering if this line: https://github.com/clulab/reach/blob/master/bioresources/src/main/resources/application.conf#L1 could be responsible for the issue.

Any help would be appreciated!

bgyori commented 2 years ago

As a minor side-issue, note how in the logging message in the case I was running this from the reach repo folder, it says

22:41:45.838 [main] INFO  o.c.p.bionlp.BioNLPProcessor - 
Done. Read org.clulab.processors.bionlp.ner.ReachSingleStandardKbSource$$anon$1@7e27b77a.lineCount 
lines from bio_process.tsv

so the line count is not shown as intended.

kwalcock commented 2 years ago

At runtime (except maybe during testing, so testtime), nothing should refer to the directory structure of the source code. So, I think you've found a problem. I'm looking into what to do about it.

MihaiSurdeanu commented 2 years ago

Thank you Keith!

On Tue, Aug 24, 2021 at 8:14 AM Keith Alcock @.***> wrote:

At runtime (except maybe during testing, so testtime), nothing should refer to the directory structure of the source code. So, I think you've found a problem. I'm looking into what to do about it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/759#issuecomment-904731784, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TWQS4ZW2GHZ3QNSPMDT6OZL5ANCNFSM5CV344AA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

kwalcock commented 2 years ago

The logging statement does have a bug as well, but it is readily fixed.

kwalcock commented 2 years ago

This is being addressed with #760.

bgyori commented 2 years ago

Thank you! I have now been using this and everything seems to work.