Closed ji-xin closed 6 years ago
2018-11-12 17:58:12 INFO AbstractSequenceClassifier:88 - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.2 sec].
2018-11-12 17:58:12 INFO AbstractSequenceClassifier:88 - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.5 sec].
2018-11-12 17:58:13 INFO AbstractSequenceClassifier:88 - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.5 sec].
2018-11-12 17:58:26 INFO TokensRegexNERAnnotator:88 - ner.fine.regexner: Read 580704 unique entries out of 581863 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab, 0 TokensRegex patterns.
2018-11-12 17:58:26 INFO TokensRegexNERAnnotator:88 - ner.fine.regexner: Read 4869 unique entries out of 4869 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_cased.tab, 0 TokensRegex patterns.
2018-11-12 17:58:26 INFO TokensRegexNERAnnotator:88 - ner.fine.regexner: Read 585573 unique entries from 2 files
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.HashMap.resize(HashMap.java:704)
at java.util.HashMap.putVal(HashMap.java:629)
at java.util.HashMap.put(HashMap.java:612)
at java.util.HashSet.add(HashSet.java:220)
at edu.stanford.nlp.ling.tokensregex.SequencePattern$State.add(SequencePattern.java:1277)
at edu.stanford.nlp.ling.tokensregex.SequencePattern$Frag.connect(SequencePattern.java:2026)
at edu.stanford.nlp.ling.tokensregex.SequencePattern$GroupPatternExpr.build(SequencePattern.java:676)
at edu.stanford.nlp.ling.tokensregex.SequencePattern.<init>(SequencePattern.java:128)
at edu.stanford.nlp.ling.tokensregex.SequencePattern.<init>(SequencePattern.java:116)
at edu.stanford.nlp.ling.tokensregex.TokenSequencePattern.<init>(TokenSequencePattern.java:149)
at edu.stanford.nlp.ling.tokensregex.TokenSequencePattern.compile(TokenSequencePattern.java:238)
at edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.createPatternMatcher(TokensRegexNERAnnotator.java:383)
at edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.<init>(TokensRegexNERAnnotator.java:317)
at edu.stanford.nlp.pipeline.NERCombinerAnnotator.setUpFineGrainedNER(NERCombinerAnnotator.java:220)
at edu.stanford.nlp.pipeline.NERCombinerAnnotator.<init>(NERCombinerAnnotator.java:145)
at edu.stanford.nlp.pipeline.AnnotatorImplementations.ner(AnnotatorImplementations.java:68)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getNamedAnnotators$5(StanfordCoreNLP.java:523)
at edu.stanford.nlp.pipeline.StanfordCoreNLP$$Lambda$23/1290272762.apply(Unknown Source)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$null$30(StanfordCoreNLP.java:602)
at edu.stanford.nlp.pipeline.StanfordCoreNLP$$Lambda$45/1635378213.get(Unknown Source)
at edu.stanford.nlp.util.Lazy$3.compute(Lazy.java:126)
at edu.stanford.nlp.util.Lazy.get(Lazy.java:31)
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:149)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:251)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:192)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:188)
at ca.uwaterloo.cs651.project.CoreNLP.main(CoreNLP.java:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
2018-11-12 17:58:57 INFO ShutdownHookManager:54 - Shutdown hook called
2018-11-12 17:58:57 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-9c1ee18c-a8c0-4b6b-b30b-67717418e902
Solved: Add --driver-memory 4G in the spark-submit command, which turns to be
spark-submit --class ca.uwaterloo.cs651.project.CoreNLP --driver-memory 4G target/project-1.0.jar -input simpledata -output output -functionality ner
The reason is that: To initialize a full pipeline for CoreNLP, it needs a heap memory size larger than 3GB and the default configuration is 1GB for both the driver and workers. I set 4GB for convenience. The reason why we set driver memory to be 4GB is that the driver memory refers to the file where SparkConf is specified and I am not sure about the heap size for workers. From previous assignments, probably to 24GB is fine.
Reference: http://spark.apache.org/docs/latest/configuration.html
I tried 2g and it also worked.
Oh, okay. Then, it doesn't matter. Set 4G at this moment. Since it just specifies the maximum heap size for this driver.
Not even related to spark, but happens when I construct the StanfordNLP pipeline, which is just a copy-and-paste from https://stanfordnlp.github.io/CoreNLP/ner.html#java-api-example.