Closed GoogleCodeExporter closed 9 years ago
Sorry I didn't see this for so long. If you've got it down to a single file, if
you e-mail it to me I'll try and take a look at some point.
Original comment by widd...@google.com
on 11 Jun 2010 at 8:58
It seems that this problem happens when tokens are removed from a Lucene index
with a StopFilter. I had the same problem when I used this Analyzer with a
rather big list of stopwords:
TokenStream result = new StandardTokenizer(matchVersion, reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
result, stopSet);
result = new GermanStemFilter(result, exclusionSet);
return result;
This was the relevant line to build the index:
doc.add(new Field("contents", text, Field.Store.NO, Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS));
It seems that the TermPositions which Lucene saves still include stop words
that are removed later. TermTermVectorsFromLucene then just counts the number
of term which were actually saved in the index:
for (int i = 0; i < numwords; ++i) {
numpositions += freqs[i];
}
So in this case it happens that the "posns" variable then contains higher
values than what actually exists in the "positions" array.
When I disable the StopFilter, the program runs just fine.
Original comment by sebastian%realpath.org@gtempaccount.com
on 14 Jun 2010 at 3:10
I've found the same problem and fixed it in the file
TermTermVectorsFromLucene:processTermPositionVector(TermPositionVector)
As you've noted above, numpositions is incorrectly calculated by summing the
words that survive the stopfilter. The actual number of positions is the sum of
the words that survive the stopfilter and the words that were removed by the
filter.
A dumb but easy approach to instead determine the greatest position that will
be used. I call it dumb because it examines every position to determine the
greatest. Here is what I mean:
-------begin code snippet-------
for(short tcn = 0; tcn < numwords; tcn++)
{
int[] posns = vex.getTermPositions(tcn);
for(int pc = 0; pc < posns.length; pc++)
{
numpositions = Math.max(numpositions, posns[pc]);
}
}
numpositions++; //add one because posns is off by one
-------end code snippet-------
The next problem is that some values of the positions array will be unused. My
solution was to fill the array with "-1" and then when summing the vectors for
a focus term I skip all the -1 terms.
I'd be happy to provide code if anyone wants it. And I'm open to suggestions on
improving this method, perhaps by knowing the number of stop words omitted or
maybe by knowing the natural number of terms in the original document that
might be stored in the index file.
Original comment by ThomasCR...@gmail.com
on 30 Oct 2010 at 2:23
I've committed a fix in revision 385. I've tested against an index of stopword
filtered documents. I believe this issue can be closed now.
I've changed the status to "Fixed", waiting for another person to verify.
Original comment by ThomasCR...@gmail.com
on 17 Nov 2010 at 7:13
Looks good and tests all work out for me. Marking as verified, thanks so much
Thomas!
Original comment by widd...@google.com
on 17 Nov 2010 at 12:54
Original issue reported on code.google.com by
rmal...@gmail.com
on 20 May 2010 at 8:50