geoparser / geolocator

Apache License 2.0
19 stars 7 forks source link

Stuck with big texts #4

Open damienpalacio opened 10 years ago

damienpalacio commented 10 years ago

Hi,

sorry seems I'm opening all the issues :) I notice geolocator stuck and keep using CPU at 100% with big texts (I let it work the whole weekend nothing happened with a particular text).

I had this problem with a text of 141977 characters. I know you developed it for tweet so I guess that's too big.

Here is the jstack trace: jstack 21373 2013-12-02 11:42:57 Full thread dump OpenJDK 64-Bit Server VM (23.7-b01 mixed mode):

"Attach Listener" daemon prio=10 tid=0x00007f508c001000 nid=0x5436 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"Service Thread" daemon prio=10 tid=0x00007f50c81da000 nid=0x5389 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007f50c81d7800 nid=0x5388 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007f50c81d4800 nid=0x5387 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007f50c81d2800 nid=0x5386 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007f50c8179000 nid=0x5385 in Object.wait() [0x00007f50bb5f4000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method)

"Reference Handler" daemon prio=10 tid=0x00007f50c8177000 nid=0x5384 in Object.wait() [0x00007f50bb6f5000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method)

"main" prio=10 tid=0x00007f50c8009000 nid=0x537e runnable [0x00007f50cf76b000] java.lang.Thread.State: RUNNABLE at java.lang.Character$UnicodeBlock.of(Character.java:3012) at java.util.regex.Pattern$Block.isSatisfiedBy(Pattern.java:3785) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694) at java.util.regex.Pattern$Curly.match(Pattern.java:4125) at java.util.regex.Pattern$Start.match(Pattern.java:3408) at java.util.regex.Matcher.search(Matcher.java:1199) at java.util.regex.Matcher.find(Matcher.java:592) at java.util.regex.Matcher.replaceAll(Matcher.java:902) at edu.cmu.geoparser.common.StringUtil.deAccent(StringUtil.java:126) at edu.cmu.geoparser.resource.trie.Trie.search(Trie.java:282) at edu.cmu.geoparser.nlp.ner.FeatureExtractor.FeatureGenerator.gazTag(FeatureGenerator.java:588) at edu.cmu.geoparser.nlp.ner.FeatureExtractor.FeatureGenerator.extractFeature(FeatureGenerator.java:237) at edu.cmu.geoparser.parser.english.EnglishMTNERParser.parse(EnglishMTNERParser.java:82) at edu.cmu.geoparser.parser.english.EnglishParser.parse(EnglishParser.java:57) at geoparsing.Geoparse.extractGeolocator(Geoparse.java:271) at geoparsing.Geoparse.extractLocations(Geoparse.java:154) at geoparsing.Geoparse.main(Geoparse.java:120) at geoparsing.Main.main(Main.java:29)

"VM Thread" prio=10 tid=0x00007f50c816e800 nid=0x5383 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f50c8016800 nid=0x537f runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f50c8018800 nid=0x5380 runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f50c801a800 nid=0x5381 runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f50c801c000 nid=0x5382 runnable

"VM Periodic Task Thread" prio=10 tid=0x00007f50c81e4800 nid=0x538a waiting on condition

JNI global references: 162

Anyway, I find a solution by splitting the text. For 1 document I tried different sizes of text cut: 500, 1000 and 10000 and it took respectivelely ~3m, ~6min and 1,76 hour to process the same text (including each time loading the gazetteer).