dileepajayakody / semanticvectors

Automatically exported from code.google.com/p/semanticvectors
Other
1 stars 0 forks source link

Overflow in pitt.search.semanticvectors.TermTermVectorsFromLucene.processTermPositionVector #24

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Building a positional index from corpus with large files leads to this:

bash% java pitt.search.semanticvectors.BuildPositionalIndex -minfrequency 50 
-dimension 100 -
windowradius 10 positional_index
Lucene positional index being set to: positional_index
Lucene index = positional_index
Seedlength = 10
Vector length = 100
Minimum frequency = 50
Number non-alphabet characters = 0
Window radius = 10
Creating basic term vectors ...
There are 2166 terms (and 1 docs)
0 ... Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 
-32768
    at 
pitt.search.semanticvectors.TermTermVectorsFromLucene.processTermPositionVector(
TermTerm
VectorsFromLucene.java:250)
    at 
pitt.search.semanticvectors.TermTermVectorsFromLucene.<init>(TermTermVectorsFrom
Lucene.j
ava:182)
    at pitt.search.semanticvectors.BuildPositionalIndex.main(BuildPositionalIndex.java:143)

Original issue reported on code.google.com by rmal...@gmail.com on 20 May 2010 at 8:50

GoogleCodeExporter commented 9 years ago
Sorry I didn't see this for so long. If you've got it down to a single file, if 
you e-mail it to me I'll try and take a look at some point.

Original comment by widd...@google.com on 11 Jun 2010 at 8:58

GoogleCodeExporter commented 9 years ago
It seems that this problem happens when tokens are removed from a Lucene index 
with a StopFilter. I had the same problem when I used this Analyzer with a 
rather big list of stopwords:

TokenStream result = new StandardTokenizer(matchVersion, reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new 
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion), 
result, stopSet);
result = new GermanStemFilter(result, exclusionSet);
return result;

This was the relevant line to build the index:

doc.add(new Field("contents", text, Field.Store.NO, Field.Index.ANALYZED, 
Field.TermVector.WITH_POSITIONS));

It seems that the TermPositions which Lucene saves still include stop words 
that are removed later. TermTermVectorsFromLucene then just counts the number 
of term which were actually saved in the index:

for (int i = 0; i < numwords; ++i) {
  numpositions += freqs[i];
}

So in this case it happens that the "posns" variable then contains higher 
values than what actually exists in the "positions" array.

When I disable the StopFilter, the program runs just fine.

Original comment by sebastian%realpath.org@gtempaccount.com on 14 Jun 2010 at 3:10

GoogleCodeExporter commented 9 years ago
I've found the same problem and fixed it in the file 
TermTermVectorsFromLucene:processTermPositionVector(TermPositionVector)

As you've noted above, numpositions is incorrectly calculated by summing the 
words that survive the stopfilter. The actual number of positions is the sum of 
the words that survive the stopfilter and the words that were removed by the 
filter.

A dumb but easy approach to instead determine the greatest position that will 
be used. I call it dumb because it examines every position to determine the 
greatest. Here is what I mean:

-------begin code snippet-------
    for(short tcn = 0; tcn < numwords; tcn++)
    {
      int[] posns = vex.getTermPositions(tcn);
      for(int pc = 0; pc < posns.length; pc++)
      {
        numpositions = Math.max(numpositions, posns[pc]);
      }
    }
    numpositions++; //add one because posns is off by one

-------end code snippet-------
The next problem is that some values of the positions array will be unused. My 
solution was to fill the array with "-1" and then when summing the vectors for 
a focus term I skip all the -1 terms.

I'd be happy to provide code if anyone wants it. And I'm open to suggestions on 
improving this method, perhaps by knowing the number of stop words omitted or 
maybe by knowing the natural number of terms in the original document that 
might be stored in the index file.

Original comment by ThomasCR...@gmail.com on 30 Oct 2010 at 2:23

GoogleCodeExporter commented 9 years ago
I've committed a fix in revision 385. I've tested against an index of stopword 
filtered documents. I believe this issue can be closed now.

I've changed the status to "Fixed", waiting for another person to verify.

Original comment by ThomasCR...@gmail.com on 17 Nov 2010 at 7:13

GoogleCodeExporter commented 9 years ago
Looks good and tests all work out for me. Marking as verified, thanks so much 
Thomas!

Original comment by widd...@google.com on 17 Nov 2010 at 12:54