Constannnnnt / Distributed-CoreNLP

This infrastructure, built on Stanford CoreNLP, MapReduce and Spark with Java, aims at processing documents annotations at large scale.
https://github.com/Constannnnnt/Distributed-CoreNLP
MIT License
0 stars 0 forks source link

OOM Error when running NER and DCOREF on SimpleNLP (SUTime) #1

Closed ji-xin closed 5 years ago

ji-xin commented 5 years ago

An error that is very likely to be caused by SUTime

2018-11-12 16:42:59 INFO  TimeExpressionExtractorImpl:88 - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
2018-11-12 16:43:46 ERROR Utils:91 - Aborting task
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.asList(Arrays.java:3800)
        at edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.readEntries(TokensRegexNERAnnotator.java:808)
        at edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.readEntries(TokensRegexNERAnnotator.java:615)
        at edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.<init>(TokensRegexNERAnnotator.java:315)
        at edu.stanford.nlp.pipeline.NERCombinerAnnotator.setUpFineGrainedNER(NERCombinerAnnotator.java:220)
        at edu.stanford.nlp.pipeline.NERCombinerAnnotator.<init>(NERCombinerAnnotator.java:145)
        at edu.stanford.nlp.pipeline.AnnotatorImplementations.ner(AnnotatorImplementations.java:68)
        at edu.stanford.nlp.simple.Document$2.lambda$null$0(Document.java:111)
        at edu.stanford.nlp.simple.Document$2$$Lambda$61/336633101.get(Unknown Source)
        at edu.stanford.nlp.util.Lazy$2.compute(Lazy.java:106)
        at edu.stanford.nlp.util.Lazy.get(Lazy.java:31)
        at edu.stanford.nlp.simple.Document$2.get(Document.java:111)
        at edu.stanford.nlp.simple.Document$2.get(Document.java:106)
        at edu.stanford.nlp.simple.Document.runNER(Document.java:853)
        at edu.stanford.nlp.simple.Sentence.nerTags(Sentence.java:528)
        at edu.stanford.nlp.simple.Sentence.nerTags(Sentence.java:536)
        at ca.uwaterloo.cs651.project.SimpleNLP.lambda$main$fceadcfc$1(SimpleNLP.java:96)
        at ca.uwaterloo.cs651.project.SimpleNLP$$Lambda$27/363625212.call(Unknown Source)
        at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:125)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
        at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
ji-xin commented 5 years ago

ner_log.txt

dcoref_log.txt

Constannnnnt commented 5 years ago

Instead of being an error caused by SUTime, I guess it is more likely to a case that the document is too large and it can't be feed into a Document Object in simple so that all memory get killed. Maybe the linux configuration limits the memory size?

Remember we can run a simple sentence and everything works, right?

ji-xin commented 5 years ago

SimpleData is used as input and there are only 2 sentences in it

Constannnnnt commented 5 years ago

Hmmm, wired. I will take a look and have a try tonight.

ji-xin commented 5 years ago

I would say this should be the end of SimpleNLP.java. These two issues can hardly be solved since SimpleNLP itself is hardly customizable.

Now let's move on to work on CoreNLP.java, and migrate things from [https://stanfordnlp.github.io/CoreNLP/annotators.html]. Here are some important annotators to work on, such as NER and Relation Extraction. Let's work on these two first.