geoparser / geolocator

Apache License 2.0
19 stars 7 forks source link

Update to Lucene4 #3

Open damienpalacio opened 10 years ago

damienpalacio commented 10 years ago

Good morning,

it's not an issue, but using it in my system, I had compatibility problem with Lucene as I'm using Lucene 4. So I updated geolocator code to make it work with Lucene 4. Seems it's not possible to attach something else than image here, so I will just past the changes if you are interested to update too? I can send the modified files by email otherwise (line number are approximate)

//import org.apache.lucene.queryParser.ParseException; //import org.apache.lucene.queryParser.QueryParser; // replaced by: import org.apache.lucene.queryparser.classic.ParseException; import org.apache.lucene.queryparser.classic.QueryParser;

in : edu.cmu.geoparser.Disambiguation.utils.TimeWindow4Tweets (line 37) edu.cmu.geoparser.nlp.ner.FeatureExtractor.FeatureGenerator (line 35) edu.cmu.geoparser.nlp.spelling.ArabDictionaryMerging (line 18) edu.cmu.geoparser.nlp.spelling.DictionaryMerging (line 18) edu.cmu.geoparser.nlp.spelling.EuroLangMisspellParser (line 15) edu.cmu.geoparser.nlp.spelling.MisspellParser (line 5) edu.cmu.geoparser.parser.english.EnglishRuleToponymParser (line 35) edu.cmu.geoparser.parser.spanish.SpanishRuleToponymParser (line 34) edu.cmu.geoparser.parser.TPParser (line 29)

", new HashSet()" delete from Constructor call (SpanishAnalyzer, QueryParser, StandardAnalyzer) like //qp = new QueryParser(Version.LUCENE_36, indexpath, new StandardAnalyzer(Version.LUCENE_36, new HashSet())); // replaced by: qp = new QueryParser(Version.LUCENE_36, indexpath, new StandardAnalyzer(Version.LUCENE_36));

in : edu.cmu.geoparser.Disambiguation.utils.TimeWindow4Tweets (line 89) edu.cmu.geoparser.io.GetWriter (line 51) edu.cmu.geoparser.nlp.spelling.ArabDictionaryMerging (line 48) edu.cmu.geoparser.nlp.spelling.DictionaryMerging (line 48) edu.cmu.geoparser.nlp.spelling.EuroLangMisspellParser (line 47)

Seems we don't close anymore IndexSearcher, so it needs to be removed //indexSearcher.close(); in : edu.cmu.geoparser.nlp.spelling.ArabDictionaryMerging (line 55) edu.cmu.geoparser.nlp.spelling.DictionaryMerging (line 55) edu.cmu.geoparser.resource.trie.IndexSupportedTrie (line 83)

Last part, types changed (NumericalField and Field don't exist anymore) and the same for the method .setValue() replaced by .setStringValue() for example edu.cmu.geoparser.Disambiguation.utils.IndexTweets Line 60: //\ FIX Lucene 4 //NumericField nftime = new NumericField("CREATEDAT", Field.Store.YES, // true); Field sfjson = new Field("JSON", false, "", Field.Store.YES, // Index.ANALYZED, TermVector.NO); // replaced by: DoubleField nftime = new DoubleField("CREATEDAT", 0.0, Field.Store.YES); TextField sfjson = new TextField("JSON", "", Field.Store.YES); // Line 84 //\ FIX Lucene 4 //sfjson.setValue(line); // replaced by: sfjson.setStringValue(line); //

edu.cmu.geoparser.resource.gazindexing.GazIndexer Line 95: //\ FIX Lucene 4 //NumericField nfid = new NumericField("ID", Field.Store.YES, true); DoubleField nfid = new DoubleField("ID", 0.0, Field.Store.YES); //NumericField nflong = new NumericField("LONGTITUDE", Field.Store.YES, true); DoubleField nflong = new DoubleField("LONGTITUDE", 0.0, Field.Store.YES); //NumericField nfla = new NumericField("LATITUDE", Field.Store.YES, true); DoubleField nfla = new DoubleField("LATITUDE", 0.0, Field.Store.YES); //NumericField nfpop = new NumericField("POPULATION", Field.Store.YES, true); DoubleField nfpop = new DoubleField("POPULATION", 0.0, Field.Store.YES); //Field sforigin = new Field("ORIGIN", false, "", Field.Store.YES, Index.ANALYZED, TermVector.NO); TextField sforigin = new TextField("ORIGIN", "", Field.Store.YES); //Field normws = new Field("NORM-WS", false, "", Field.Store.YES, Index.NOT_ANALYZED, TermVector.NO); TextField normws = new TextField("NORM-WS", "", Field.Store.YES); //Field normnws = new Field("NORM-NO-WS", false, "", Field.Store.YES, Index.NOT_ANALYZED, TermVector.NO); TextField normnws = new TextField("NORM-NO-WS", "", Field.Store.YES); //Field sfotherlang = new Field("OTHERLANG", false, "", Field.Store.YES, Index.NOT_ANALYZED, TermVector.NO); TextField sfotherlang = new TextField("OTHERLANG", "", Field.Store.YES); //Field sfunigram = new Field("UNIGRAM", false, "", Field.Store.YES, Index.ANALYZED, TermVector.NO); TextField sfunigram = new TextField("UNIGRAM", "", Field.Store.YES); //Field sfbigram = new Field("BIGRAM", false, "", Field.Store.YES, Index.ANALYZED, TermVector.NO); TextField sfbigram = new TextField("BIGRAM", "", Field.Store.YES); //Field sftrigram = new Field("TRIGRAM", false, "", Field.Store.YES, Index.ANALYZED, TermVector.NO); TextField sftrigram = new TextField("TRIGRAM", "", Field.Store.YES); //Field sfposition = new Field("POSITION", false, "", Field.Store.YES, Index.ANALYZED, TermVector.NO); TextField sfposition = new TextField("POSITION", "", Field.Store.YES); //Field sfcountrystate = new Field("COUNTRYSTATE", false, "", Field.Store.YES, Index.NOT_ANALYZED, TermVector.NO); TextField sfcountrystate = new TextField("COUNTRYSTATE", "", Field.Store.YES); //Field sffeature = new Field("FEATURE", false, "", Field.Store.YES, Index.NOT_ANALYZED, TermVector.NO); TextField sffeature = new TextField("FEATURE", "", Field.Store.YES); //Field sftimezone = new Field("TIMEZONE", false, "", Field.Store.YES, Index.NOT_ANALYZED, TermVector.NO); TextField sftimezone = new TextField("TIMEZONE", "", Field.Store.YES); //* Line 181: //* FIX Lucene 4
// sforigin.setValue(phrase);// 5 // normws.setValue(StringUtil.getDeAccentLoweredString(phrase)); // normnws.setValue(StringUtil.getDeAccentLoweredString(phrase).replaceAll(" ", "")); // sfotherlang.setValue(otherlang); sforigin.setStringValue(phrase);// 5 normws.setStringValue(StringUtil.getDeAccentLoweredString(phrase)); normnws.setStringValue(StringUtil.getDeAccentLoweredString(phrase).replaceAll(" ", "")); sfotherlang.setStringValue(otherlang);

        getIndexFeatures(phrase);

// sfunigram.setValue(getUnigram()); // sfbigram.setValue(getBigram()); // sftrigram.setValue(getTrigram()); // sfposition.setValue(getPositionUnigram());// 10 // sfcountrystate.setValue(country + "" + state); // sffeature.setValue(featureclass + "" + feature); // sftimezone.setValue(timezone);// 13 sfunigram.setStringValue(getUnigram()); sfbigram.setStringValue(getBigram()); sftrigram.setStringValue(getTrigram()); sfposition.setStringValue(getPositionUnigram());// 10 sfcountrystate.setStringValue(country + "" + state); sffeature.setStringValue(featureclass + "" + feature); sftimezone.setStringValue(timezone);// 13 //**

edu.cmu.geoparser.resource.gazindexing.GazIndexerForAlternativeNames Line 94 //\ FIX Lucene 4 //NumericField nfid = new NumericField("ID", Field.Store.YES, true); DoubleField nfid = new DoubleField("ID", 0.0, Field.Store.YES); //NumericField nflong = new NumericField("LONGTITUDE", Field.Store.YES, true); DoubleField nflong = new DoubleField("LONGTITUDE", 0.0, Field.Store.YES); //NumericField nfla = new NumericField("LATITUDE", Field.Store.YES, true); DoubleField nfla = new DoubleField("LATITUDE", 0.0, Field.Store.YES); //NumericField nfpop = new NumericField("POPULATION", Field.Store.YES, true); DoubleField nfpop = new DoubleField("POPULATION", 0.0, Field.Store.YES); //Field sforigin = new Field("ORIGIN", false, "", Field.Store.YES, Index.ANALYZED, TermVector.NO); TextField sforigin = new TextField("ORIGIN", "", Field.Store.YES); //Field normws = new Field("NORM-WS", false, "", Field.Store.YES, Index.NOT_ANALYZED, TermVector.NO); TextField normws = new TextField("NORM-WS", "", Field.Store.YES); //Field normnws = new Field("NORM-NO-WS", false, "", Field.Store.YES, Index.NOT_ANALYZED, TermVector.NO); TextField normnws = new TextField("NORM-NO-WS", "", Field.Store.YES); //Field sfotherlang = new Field("OTHERLANG", false, "", Field.Store.YES, Index.NOT_ANALYZED, TermVector.NO); TextField sfotherlang = new TextField("OTHERLANG", "", Field.Store.YES); //Field sfunigram = new Field("UNIGRAM", false, "", Field.Store.YES, Index.ANALYZED, TermVector.NO); TextField sfunigram = new TextField("UNIGRAM", "", Field.Store.YES); //Field sfbigram = new Field("BIGRAM", false, "", Field.Store.YES, Index.ANALYZED, TermVector.NO); TextField sfbigram = new TextField("BIGRAM", "", Field.Store.YES); //Field sftrigram = new Field("TRIGRAM", false, "", Field.Store.YES, Index.ANALYZED, TermVector.NO); TextField sftrigram = new TextField("TRIGRAM", "", Field.Store.YES); //Field sfposition = new Field("POSITION", false, "", Field.Store.YES, Index.ANALYZED, TermVector.NO); TextField sfposition = new TextField("POSITION", "", Field.Store.YES); //Field sfcountrystate = new Field("COUNTRYSTATE", false, "", Field.Store.YES, Index.NOT_ANALYZED, TermVector.NO); TextField sfcountrystate = new TextField("COUNTRYSTATE", "", Field.Store.YES); //Field sffeature = new Field("FEATURE", false, "", Field.Store.YES, Index.NOT_ANALYZED, TermVector.NO); TextField sffeature = new TextField("FEATURE", "", Field.Store.YES); //Field sftimezone = new Field("TIMEZONE", false, "", Field.Store.YES, Index.NOTANALYZED, TermVector.NO); TextField sftimezone = new TextField("TIMEZONE", "", Field.Store.YES); // Line 180 //\ FIX Lucene 4 // sforigin.setValue(phrase);// 5 // normws.setValue(StringUtil.getDeAccentLoweredString(phrase)); // normnws.setValue(StringUtil.getDeAccentLoweredString(phrase).replaceAll(" ", "")); // sfotherlang.setValue(otherlang); sforigin.setStringValue(phrase);// 5 normws.setStringValue(StringUtil.getDeAccentLoweredString(phrase)); normnws.setStringValue(StringUtil.getDeAccentLoweredString(phrase).replaceAll(" ", "")); sfotherlang.setStringValue(otherlang); // Line 193 //\ FIX Lucene 4 // sfunigram.setValue(getUnigram()); // sfbigram.setValue(getBigram()); // sftrigram.setValue(getTrigram()); // sfposition.setValue(getPositionUnigram());// 10 // sfcountrystate.setValue(country + "" + state); // sffeature.setValue(featureclass + "" + feature); // sftimezone.setValue(timezone);// 13 sfunigram.setStringValue(getUnigram()); sfbigram.setStringValue(getBigram()); sftrigram.setStringValue(getTrigram()); sfposition.setStringValue(getPositionUnigram());// 10 sfcountrystate.setStringValue(country + "" + state); sffeature.setStringValue(featureclass + "" + feature); sftimezone.setStringValue(timezone);// 13 // Line 220 //\ FIX Lucene 4 // sforigin.setValue(ph);// 5 // normws.setValue(StringUtil.getDeAccentLoweredString(ph)); // normnws.setValue(StringUtil.getDeAccentLoweredString(ph).replaceAll(" ", "")); // sfotherlang.setValue(""); sforigin.setStringValue(ph);// 5 normws.setStringValue(StringUtil.getDeAccentLoweredString(ph)); normnws.setStringValue(StringUtil.getDeAccentLoweredString(ph).replaceAll(" ", "")); sfotherlang.setStringValue(""); // Line 233 //\ FIX Lucene 4 // sfunigram.setValue(getUnigram()); // sfbigram.setValue(getBigram()); // sftrigram.setValue(getTrigram()); // sfposition.setValue(getPositionUnigram());// 10 // sfcountrystate.setValue(country + "" + state); // sffeature.setValue(featureclass + "" + feature); // sftimezone.setValue(timezone);// 13 sfunigram.setStringValue(getUnigram()); sfbigram.setStringValue(getBigram()); sftrigram.setStringValue(getTrigram()); sfposition.setStringValue(getPositionUnigram());// 10 sfcountrystate.setStringValue(country + "" + state); sffeature.setStringValue(featureclass + "_" + feature); sftimezone.setStringValue(timezone);// 13 // Line 269 //\ FIX Lucene 4
//iw.optimize(); //

edu.cmu.geoparser.resource.TweetIndexer Line 105 //\ FIX Lucene 4 //Field fcontent = new Field("CONTENT", false, "", Field.Store.YES, Index.NOT_ANALYZED, TermVector.NO); TextField fcontent = new TextField("CONTENT", "", Field.Store.YES); //Field fjson = new Field("JSON", false, "", Field.Store.YES, Index.NOT_ANALYZED, TermVector.NO); //NumericField ftime = new NumericField("TIME", Field.Store.YES, true); DoubleField ftime = new DoubleField("TIME", 0.0, Field.Store.YES); //Field fuserdesc =new Field("DESC",false,"",Field.Store.YES,Index.NOT_ANALYZED,TermVector.NO); TextField fuserdesc = new TextField("DESC", "", Field.Store.YES); //Field fuserlocation = new Field("USERLOC",false,"",Field.Store.YES,Index.NOT_ANALYZED,TermVector.NO); TextField fuserlocation = new TextField("USERLOC", "", Field.Store.YES); //Field ftimezone = new Field("TIMEZONE",false,"",Field.Store.YES,Index.NOT_ANALYZED,TermVector.NO); TextField ftimezone = new TextField("TIMEZONE", "", Field.Store.YES); // Line 167 //\ FIX Lucene 4 // fcontent.setValue(content); fcontent.setStringValue(content); //

Tested with Lucene 4.5.1 (I added lucene-analyzers-common-4.5.1.jar, lucene-core-4.5.1.jar, lucene-queries-4.5.1.jar and lucene-queryparser-4.5.1.jar), and the geoparsing works the same (I didn't try the indexing part of GeoNames). I hope I didn't forget something.

geoparser commented 10 years ago

Hi, Lucene 4.X is not compatible with Lucene 3.6. This is because, Lucene 3.6 uses class "Field", and different data types will use the same class with different parameters. However, in 4.X, The "Field" is not used, instead, a bunch of sub-classes are used to store the fields, such as StringField, IntField, TextField, DoubleField, and the creation of Field class is not supported. please use lucene 3.6 for this project. Thanks.

damienpalacio commented 10 years ago

Hi,

yes but my system is using different APIs, and one required Lucene 4, so if I load Lucene 4 I can't load Lucene 3 because of similar classes. You can just use TextField for everything instead of Field as I did.

Currently with the changes it seems to work well, I processed a text and I got the same results

geoparser commented 10 years ago

Glad to hear that. TextFIeld is the parsed field, and StringField is not parsed. For efficiency, StringField will be better. And, using text field will enable partial match of a phrase. For instance, matching part of the phrase is enabled.

I am working on a new version of index, using Lucene 4.5.1. The indexing is done, but added some new interfaces to the program. I will probably publish it later next month.

Thanks.

damienpalacio commented 10 years ago

thanks for the information! Yes I didn't really look on the best efficient update, I wanted to see if it was possible without too much change

Nice! I'm waiting for the new version! I have problem with the program stuck and keep using cpu with big text, I didn't find why and where yet but I will post it when I will find it