sorry seems I'm opening all the issues :)
I notice geolocator stuck and keep using CPU at 100% with big texts (I let it work the whole weekend nothing happened with a particular text).
I had this problem with a text of 141977 characters. I know you developed it for tweet so I guess that's too big.
Here is the jstack trace:
jstack 21373
2013-12-02 11:42:57
Full thread dump OpenJDK 64-Bit Server VM (23.7-b01 mixed mode):
"Finalizer" daemon prio=10 tid=0x00007f50c8179000 nid=0x5385 in Object.wait() [0x00007f50bb5f4000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
waiting on <0x0000000681e22ca0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
locked <0x0000000681e22ca0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)
"Reference Handler" daemon prio=10 tid=0x00007f50c8177000 nid=0x5384 in Object.wait() [0x00007f50bb6f5000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
waiting on <0x0000000681e22cd0> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:503)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
locked <0x0000000681e22cd0> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x00007f50c8009000 nid=0x537e runnable [0x00007f50cf76b000]
java.lang.Thread.State: RUNNABLE
at java.lang.Character$UnicodeBlock.of(Character.java:3012)
at java.util.regex.Pattern$Block.isSatisfiedBy(Pattern.java:3785)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694)
at java.util.regex.Pattern$Curly.match(Pattern.java:4125)
at java.util.regex.Pattern$Start.match(Pattern.java:3408)
at java.util.regex.Matcher.search(Matcher.java:1199)
at java.util.regex.Matcher.find(Matcher.java:592)
at java.util.regex.Matcher.replaceAll(Matcher.java:902)
at edu.cmu.geoparser.common.StringUtil.deAccent(StringUtil.java:126)
at edu.cmu.geoparser.resource.trie.Trie.search(Trie.java:282)
at edu.cmu.geoparser.nlp.ner.FeatureExtractor.FeatureGenerator.gazTag(FeatureGenerator.java:588)
at edu.cmu.geoparser.nlp.ner.FeatureExtractor.FeatureGenerator.extractFeature(FeatureGenerator.java:237)
at edu.cmu.geoparser.parser.english.EnglishMTNERParser.parse(EnglishMTNERParser.java:82)
at edu.cmu.geoparser.parser.english.EnglishParser.parse(EnglishParser.java:57)
at geoparsing.Geoparse.extractGeolocator(Geoparse.java:271)
at geoparsing.Geoparse.extractLocations(Geoparse.java:154)
at geoparsing.Geoparse.main(Geoparse.java:120)
at geoparsing.Main.main(Main.java:29)
"VM Periodic Task Thread" prio=10 tid=0x00007f50c81e4800 nid=0x538a waiting on condition
JNI global references: 162
Anyway, I find a solution by splitting the text. For 1 document I tried different sizes of text cut: 500, 1000 and 10000 and it took respectivelely ~3m, ~6min and 1,76 hour to process the same text (including each time loading the gazetteer).
Hi,
sorry seems I'm opening all the issues :) I notice geolocator stuck and keep using CPU at 100% with big texts (I let it work the whole weekend nothing happened with a particular text).
I had this problem with a text of 141977 characters. I know you developed it for tweet so I guess that's too big.
Here is the jstack trace: jstack 21373 2013-12-02 11:42:57 Full thread dump OpenJDK 64-Bit Server VM (23.7-b01 mixed mode):
"Attach Listener" daemon prio=10 tid=0x00007f508c001000 nid=0x5436 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE
"Service Thread" daemon prio=10 tid=0x00007f50c81da000 nid=0x5389 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE
"C2 CompilerThread1" daemon prio=10 tid=0x00007f50c81d7800 nid=0x5388 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" daemon prio=10 tid=0x00007f50c81d4800 nid=0x5387 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x00007f50c81d2800 nid=0x5386 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x00007f50c8179000 nid=0x5385 in Object.wait() [0x00007f50bb5f4000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method)
"Reference Handler" daemon prio=10 tid=0x00007f50c8177000 nid=0x5384 in Object.wait() [0x00007f50bb6f5000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method)
"main" prio=10 tid=0x00007f50c8009000 nid=0x537e runnable [0x00007f50cf76b000] java.lang.Thread.State: RUNNABLE at java.lang.Character$UnicodeBlock.of(Character.java:3012) at java.util.regex.Pattern$Block.isSatisfiedBy(Pattern.java:3785) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694) at java.util.regex.Pattern$Curly.match(Pattern.java:4125) at java.util.regex.Pattern$Start.match(Pattern.java:3408) at java.util.regex.Matcher.search(Matcher.java:1199) at java.util.regex.Matcher.find(Matcher.java:592) at java.util.regex.Matcher.replaceAll(Matcher.java:902) at edu.cmu.geoparser.common.StringUtil.deAccent(StringUtil.java:126) at edu.cmu.geoparser.resource.trie.Trie.search(Trie.java:282) at edu.cmu.geoparser.nlp.ner.FeatureExtractor.FeatureGenerator.gazTag(FeatureGenerator.java:588) at edu.cmu.geoparser.nlp.ner.FeatureExtractor.FeatureGenerator.extractFeature(FeatureGenerator.java:237) at edu.cmu.geoparser.parser.english.EnglishMTNERParser.parse(EnglishMTNERParser.java:82) at edu.cmu.geoparser.parser.english.EnglishParser.parse(EnglishParser.java:57) at geoparsing.Geoparse.extractGeolocator(Geoparse.java:271) at geoparsing.Geoparse.extractLocations(Geoparse.java:154) at geoparsing.Geoparse.main(Geoparse.java:120) at geoparsing.Main.main(Main.java:29)
"VM Thread" prio=10 tid=0x00007f50c816e800 nid=0x5383 runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f50c8016800 nid=0x537f runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f50c8018800 nid=0x5380 runnable
"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f50c801a800 nid=0x5381 runnable
"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f50c801c000 nid=0x5382 runnable
"VM Periodic Task Thread" prio=10 tid=0x00007f50c81e4800 nid=0x538a waiting on condition
JNI global references: 162
Anyway, I find a solution by splitting the text. For 1 document I tried different sizes of text cut: 500, 1000 and 10000 and it took respectivelely ~3m, ~6min and 1,76 hour to process the same text (including each time loading the gazetteer).