Constannnnnt / Distributed-CoreNLP

This infrastructure, built on Stanford CoreNLP, MapReduce and Spark with Java, aims at processing documents annotations at large scale.
https://github.com/Constannnnnt/Distributed-CoreNLP
MIT License
0 stars 0 forks source link

OOM error still exists #3

Closed ji-xin closed 5 years ago

ji-xin commented 5 years ago

Not even related to spark, but happens when I construct the StanfordNLP pipeline, which is just a copy-and-paste from https://stanfordnlp.github.io/CoreNLP/ner.html#java-api-example.

ji-xin commented 5 years ago
2018-11-12 17:58:12 INFO  AbstractSequenceClassifier:88 - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.2 sec].
2018-11-12 17:58:12 INFO  AbstractSequenceClassifier:88 - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.5 sec].
2018-11-12 17:58:13 INFO  AbstractSequenceClassifier:88 - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.5 sec].
2018-11-12 17:58:26 INFO  TokensRegexNERAnnotator:88 - ner.fine.regexner: Read 580704 unique entries out of 581863 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab, 0 TokensRegex patterns.
2018-11-12 17:58:26 INFO  TokensRegexNERAnnotator:88 - ner.fine.regexner: Read 4869 unique entries out of 4869 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_cased.tab, 0 TokensRegex patterns.
2018-11-12 17:58:26 INFO  TokensRegexNERAnnotator:88 - ner.fine.regexner: Read 585573 unique entries from 2 files
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.HashMap.resize(HashMap.java:704)
    at java.util.HashMap.putVal(HashMap.java:629)
    at java.util.HashMap.put(HashMap.java:612)
    at java.util.HashSet.add(HashSet.java:220)
    at edu.stanford.nlp.ling.tokensregex.SequencePattern$State.add(SequencePattern.java:1277)
    at edu.stanford.nlp.ling.tokensregex.SequencePattern$Frag.connect(SequencePattern.java:2026)
    at edu.stanford.nlp.ling.tokensregex.SequencePattern$GroupPatternExpr.build(SequencePattern.java:676)
    at edu.stanford.nlp.ling.tokensregex.SequencePattern.<init>(SequencePattern.java:128)
    at edu.stanford.nlp.ling.tokensregex.SequencePattern.<init>(SequencePattern.java:116)
    at edu.stanford.nlp.ling.tokensregex.TokenSequencePattern.<init>(TokenSequencePattern.java:149)
    at edu.stanford.nlp.ling.tokensregex.TokenSequencePattern.compile(TokenSequencePattern.java:238)
    at edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.createPatternMatcher(TokensRegexNERAnnotator.java:383)
    at edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.<init>(TokensRegexNERAnnotator.java:317)
    at edu.stanford.nlp.pipeline.NERCombinerAnnotator.setUpFineGrainedNER(NERCombinerAnnotator.java:220)
    at edu.stanford.nlp.pipeline.NERCombinerAnnotator.<init>(NERCombinerAnnotator.java:145)
    at edu.stanford.nlp.pipeline.AnnotatorImplementations.ner(AnnotatorImplementations.java:68)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getNamedAnnotators$5(StanfordCoreNLP.java:523)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP$$Lambda$23/1290272762.apply(Unknown Source)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$null$30(StanfordCoreNLP.java:602)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP$$Lambda$45/1635378213.get(Unknown Source)
    at edu.stanford.nlp.util.Lazy$3.compute(Lazy.java:126)
    at edu.stanford.nlp.util.Lazy.get(Lazy.java:31)
    at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:149)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:251)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:192)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:188)
    at ca.uwaterloo.cs651.project.CoreNLP.main(CoreNLP.java:41)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
2018-11-12 17:58:57 INFO  ShutdownHookManager:54 - Shutdown hook called
2018-11-12 17:58:57 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-9c1ee18c-a8c0-4b6b-b30b-67717418e902
Constannnnnt commented 5 years ago

Solved: Add --driver-memory 4G in the spark-submit command, which turns to be spark-submit --class ca.uwaterloo.cs651.project.CoreNLP --driver-memory 4G target/project-1.0.jar -input simpledata -output output -functionality ner

The reason is that: To initialize a full pipeline for CoreNLP, it needs a heap memory size larger than 3GB and the default configuration is 1GB for both the driver and workers. I set 4GB for convenience. The reason why we set driver memory to be 4GB is that the driver memory refers to the file where SparkConf is specified and I am not sure about the heap size for workers. From previous assignments, probably to 24GB is fine.

Reference: http://spark.apache.org/docs/latest/configuration.html

KaisongHuang commented 5 years ago

I tried 2g and it also worked.

Constannnnnt commented 5 years ago

Oh, okay. Then, it doesn't matter. Set 4G at this moment. Since it just specifies the maximum heap size for this driver.