fozziethebeat / S-Space

The S-Space repsitory, from the AIrhead-Research group
GNU General Public License v2.0
205 stars 106 forks source link

ArrayIndexOutOfBoundsException at SparseHashMatrix.getRowVector(SparseHashMatrix.java:105) after commit f6038a3501e69c4ce6011de53a0aa2bf877710c2 #69

Closed johann-petrak closed 7 years ago

johann-petrak commented 9 years ago

After commit f6038a3501e69c4ce6011de53a0aa2bf877710c2 (or potentially a later commit), code that worked just fine now throws and ArrayIndexOutOfBoundsException.

Here is the exception:

java.lang.ArrayIndexOutOfBoundsException: 4
    at edu.ucla.sspace.matrix.SparseHashMatrix.getRowVector(SparseHashMatrix.java:105)
    at edu.ucla.sspace.matrix.SparseHashMatrix.getRowVector(SparseHashMatrix.java:42)
    at edu.ucla.sspace.common.GenericTermDocumentVectorSpace.getVector(GenericTermDocumentVectorSpace.java:315)
    at edu.ucla.sspace.common.DocumentVectorBuilder.buildVector(DocumentVectorBuilder.java:129)

Here is a minimal groovy script which illustrates the problem:

import edu.ucla.sspace.common.DocumentVectorBuilder;
import edu.ucla.sspace.common.SemanticSpace;
import edu.ucla.sspace.vector.DoubleVector;
import edu.ucla.sspace.vsm.VectorSpaceModel;
import edu.ucla.sspace.vector.DenseVector;

import java.io.BufferedReader;

SemanticSpace candidateListSemSpace = new VectorSpaceModel();
candidateListSemSpace.processDocument(new BufferedReader(new StringReader("This is some text")));

Properties tficfConfig = new Properties();
tficfConfig.put(VectorSpaceModel.MATRIX_TRANSFORM_PROPERTY,"edu.ucla.sspace.matrix.TfIdfTransform");
candidateListSemSpace.processSpace(tficfConfig);
DocumentVectorBuilder tficfVectorBuilder = new DocumentVectorBuilder(candidateListSemSpace);
DoubleVector tficfContextVector = 
tficfVectorBuilder.buildVector(
  new BufferedReader(new StringReader("This is also some text")),
  new DenseVector(candidateListSemSpace.getVectorLength()));

Run this using groovy -cp <neededJars> file.groovy and this will either throw the exception, if run with the latest sspace jar on the classpath, or create a file in /tmp if run with sspace before commit f6038a35...

johann-petrak commented 8 years ago

Has there been any progress on this? The very latest head version of the software I just cloned still has this problem which is a really big problem: it means either s-space creates thousands of files in the /tmp directory or it fails with this exception.

davidjurgens commented 8 years ago

I'll see if i can get this fixed in the next few days and push a new version to github. Thanks for the reminder!

johann-petrak commented 7 years ago

Just had another try with df71a3cf323380e30550e2482ecc53bfba3801d7 and this is still a problem. The reason for this is that after building the semantic space, using any OOV (out of vocabulary word) for creating a document vector with DocumentVectorBuilder.buildVector internally will lookup the index for each word in the internal BasisMapping. The BasisMapping instance will return the next free index instead of -1 because it is not in readOnly mode. However there seems to be no API method to always make sure the Mapping gets switched to readOnly mode. The BasisMapping interface exposes the setReadOnly method, but the semantic space interface does not, nor is there a way to access the internal BasisMapping object. The easiest way to fix this is probably this: in DocumentBuilder.buildVector, before looping over the entries from the termCounts map, get the sspace.getWords set, then inside the loop, only try to get the vector if the word is in that set.