clulab / tomcat-text

Natural language text processing code for the DARPA ASIST program
3 stars 3 forks source link

Error when running ExtractDirSearch #2

Closed adarshp closed 3 years ago

adarshp commented 3 years ago

Hi @pelovett , I'm getting an error when running ExtractDirSearch. See below for the invocation and the errors. The directory /Users/adarsh/git/clulab/tomcat-text/data/study-1_2020.08 contains all the HSR*.vtt, HSR*.tsv, and HSR*.metadata files from GCS.

~/git/clulab/tomcat-text (master) $ sbt "runMain org.clulab.asist.ExtractDirSearch /Users/adarsh/git/clulab/tomcat-text/data/study-1_2020.08"
[info] Loading settings from plugins.sbt ...
[info] Loading project definition from /Users/adarsh/git/clulab/tomcat-text/project
[info] Loading settings from build.sbt ...
[info] Loading settings from build.sbt ...
[info] Set current project to asist (in build file:/Users/adarsh/git/clulab/tomcat-text/)
[warn] Multiple main classes detected.  Run 'show discoveredMainClasses' to see the list
[info] Running org.clulab.asist.ExtractDirSearch /Users/adarsh/git/clulab/tomcat-text/data/study-1_2020.08
[CoreNLP] Initializing the CoreNLP pipeline ...
13:08:40.111 [run-main-0] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
13:08:40.120 [run-main-0] INFO  e.s.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
13:08:40.128 [run-main-0] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
13:08:40.131 [run-main-0] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
13:08:41.076 [run-main-0] INFO  e.s.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.9 sec].
13:08:41.079 [run-main-0] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
13:08:41.081 [run-main-0] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
13:08:42.640 [run-main-0] INFO  e.s.n.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.4 sec].
13:08:43.159 [run-main-0] INFO  e.s.n.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.5 sec].
13:08:43.865 [run-main-0] INFO  e.s.n.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.7 sec].
13:08:43.869 [run-main-0] INFO  e.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
13:08:44.739 [run-main-0] DEBUG e.s.n.l.t.CoreMapExpressionExtractor - Ignoring inactive rule: null
13:08:44.739 [run-main-0] DEBUG e.s.n.l.t.CoreMapExpressionExtractor - Ignoring inactive rule: temporal-composite-8:ranges
13:08:44.743 [run-main-0] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
13:08:47.577 [run-main-0] INFO  e.s.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [2.8 sec].
13:08:47.893 [run-main-0] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator dcoref
[CoreNLP] Completed Initialization
[error] (run-main-0) java.lang.NullPointerException
[error] java.lang.NullPointerException
[error]     at java.io.Reader.<init>(Reader.java:78)
[error]     at java.io.InputStreamReader.<init>(InputStreamReader.java:129)
[error]     at scala.io.BufferedSource.reader(BufferedSource.scala:22)
[error]     at scala.io.BufferedSource.bufferedReader(BufferedSource.scala:23)
[error]     at scala.io.BufferedSource.charReader$lzycompute(BufferedSource.scala:33)
[error]     at scala.io.BufferedSource.charReader(BufferedSource.scala:31)
[error]     at scala.io.BufferedSource.scala$io$BufferedSource$$decachedReader(BufferedSource.scala:60)
[error]     at scala.io.BufferedSource.mkString(BufferedSource.scala:89)
[error]     at org.clulab.asist.ExtractDirSearch$.delayedEndpoint$org$clulab$asist$ExtractDirSearch$1(ExtractDirSearch.scala:430)
[error]     at org.clulab.asist.ExtractDirSearch$delayedInit$body.apply(ExtractDirSearch.scala:24)
[error]     at scala.Function0.apply$mcV$sp(Function0.scala:34)
[error]     at scala.Function0.apply$mcV$sp$(Function0.scala:34)
[error]     at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
[error]     at scala.App.$anonfun$main$1$adapted(App.scala:76)
[error]     at scala.collection.immutable.List.foreach(List.scala:389)
[error]     at scala.App.main(App.scala:76)
[error]     at scala.App.main$(App.scala:74)
[error]     at org.clulab.asist.ExtractDirSearch$.main(ExtractDirSearch.scala:24)
[error]     at org.clulab.asist.ExtractDirSearch.main(ExtractDirSearch.scala)
[error]     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error]     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error]     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error]     at java.lang.reflect.Method.invoke(Method.java:498)
[error]     at sbt.Run.invokeMain(Run.scala:93)
[error]     at sbt.Run.run0(Run.scala:87)
[error]     at sbt.Run.execute$1(Run.scala:65)
[error]     at sbt.Run.$anonfun$run$4(Run.scala:77)
[error]     at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
[error]     at sbt.util.InterfaceUtil$$anon$1.get(InterfaceUtil.scala:10)
[error]     at sbt.TrapExit$App.run(TrapExit.scala:252)
[error]     at java.lang.Thread.run(Thread.java:748)
[error] java.lang.RuntimeException: Nonzero exit code: 1
[error]     at sbt.Run$.executeTrapExit(Run.scala:124)
[error]     at sbt.Run.run(Run.scala:77)
[error]     at sbt.Defaults$.$anonfun$bgRunMainTask$6(Defaults.scala:1163)
[error]     at sbt.Defaults$.$anonfun$bgRunMainTask$6$adapted(Defaults.scala:1158)
[error]     at sbt.internal.BackgroundThreadPool.$anonfun$run$1(DefaultBackgroundJobService.scala:366)
[error]     at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
[error]     at scala.util.Try$.apply(Try.scala:209)
[error]     at sbt.internal.BackgroundThreadPool$BackgroundRunnable.run(DefaultBackgroundJobService.scala:289)
[error]     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error]     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error]     at java.lang.Thread.run(Thread.java:748)
[error] (Compile / runMain) Nonzero exit code: 1
[error] Total time: 52 s, completed Oct 22, 2020 1:08:49 PM
[INFO] [10/22/2020 13:08:49.201] [Thread-2] [CoordinatedShutdown(akka://sbt-web)] Starting coordinated shutdown from JVM shutdown hook
~/git/clulab/tomcat-text (master) $ vim src/main/scala/org/clulab/asist/ExtractDirSearch.scala
pelovett commented 3 years ago

This bug occurs because the script crudely tries to extract events from every file in the directory passed in. A simple change would be to only look at files with a *.vtt extension, but there still needs to be some logic to pair transcripts with metadata files. I'll push the simple change before adding the pairing logic.

adarshp commented 3 years ago

@pelovett Sounds good, thanks!

pelovett commented 3 years ago

This should be solved by this commit: 3423acdc28cb9fbf604bf70ff352b29d05c14e30

The linked commit introduces a separate scala app for parsing multiple transcripts. The issue of how to get relevant metadata has been split into a separate issue (#3)