fozziethebeat / S-Space

The S-Space repsitory, from the AIrhead-Research group
GNU General Public License v2.0
203 stars 106 forks source link

java.lang.ArrayIndexOutOfBoundsException #70

Open DeepthiKarnam opened 8 years ago

DeepthiKarnam commented 8 years ago

Feb 18, 2016 11:28:35 AM edu.ucla.sspace.common.GenericTermDocumentVectorSpace processSpace INFO: performing log-entropy transform Feb 18, 2016 11:28:35 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Computing the total row counts Feb 18, 2016 11:28:35 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Computing the entropy of each row Feb 18, 2016 11:28:35 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Scaling the entropy of the rows Feb 18, 2016 11:28:35 AM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace INFO: reducing to 300 dimensions Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException at edu.ucla.sspace.matrix.DiagonalMatrix.checkIndices(DiagonalMatrix.java:78) at edu.ucla.sspace.matrix.DiagonalMatrix.get(DiagonalMatrix.java:94) at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionLibJ.factorize(SingularValueDecompositionLibJ.java:89)

The number of words in my corpus turns to be 6000+. Is the code unable to reduce the size of the vector to 300 from 6000+. What is the solution ?

davidjurgens commented 8 years ago

Hi Deepthi,

Which version of the code are you using? It looks like the stack trace you have is using SvdlibJ, which we haven't supported for some time (their implementation is known to have errors in its SVD results). The latest code should definitely support reducing from 6000 dimensions to 300. How many documents are in your corpus?

Thanks, David

On Wed, Feb 17, 2016 at 10:26 PM, DeepthiKarnam notifications@github.com wrote:

Feb 18, 2016 11:28:35 AM edu.ucla.sspace.common.GenericTermDocumentVectorSpace processSpace INFO: performing log-entropy transform Feb 18, 2016 11:28:35 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Computing the total row counts Feb 18, 2016 11:28:35 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Computing the entropy of each row Feb 18, 2016 11:28:35 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Scaling the entropy of the rows Feb 18, 2016 11:28:35 AM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace INFO: reducing to 300 dimensions Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException at edu.ucla.sspace.matrix.DiagonalMatrix.checkIndices(DiagonalMatrix.java:78) at edu.ucla.sspace.matrix.DiagonalMatrix.get(DiagonalMatrix.java:94) at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionLibJ.factorize(SingularValueDecompositionLibJ.java:89)

The number of words in my corpus turns to be 6000+. Is the code unable to reduce the size of the vector to 300 from 6000+. What is the solution ?

— Reply to this email directly or view it on GitHub https://github.com/fozziethebeat/S-Space/issues/70.

DeepthiKarnam commented 8 years ago

Hi David, Thanks for your prompt reply. I have a few more questions in continuation to the above.

I am using the jar "sspace-wordsi-2.0-jar-with-dependencies.jar". Is this not supported ?

Currently, I am running on a sample of size 200 documents. However, the entire corpus is around 9000 documents. Is it scalable ?

Each document is a pdf with close to ~500 words per document (without preprocessing). I am doing a simple preprocessing to remove stopwords and special characters from the text. Do you think, any additional preprocessing will help such as lemmatization ?

DeepthiKarnam commented 8 years ago

Tried using sspace-2.0.1.jar Problem persists :'(

Feb 18, 2016 12:35:44 PM edu.ucla.sspace.common.GenericTermDocumentVectorSpace processSpace INFO: performing log-entropy transform Feb 18, 2016 12:35:44 PM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Computing the total row counts Feb 18, 2016 12:35:44 PM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Computing the entropy of each row Feb 18, 2016 12:35:44 PM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Scaling the entropy of the rows Feb 18, 2016 12:35:44 PM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace INFO: reducing to 300 dimensions Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException at edu.ucla.sspace.matrix.DiagonalMatrix.checkIndices(DiagonalMatrix.java:78) at edu.ucla.sspace.matrix.DiagonalMatrix.get(DiagonalMatrix.java:85) at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionLibJ.factorize(SingularValueDecompositionLibJ.java:89) at edu.ucla.sspace.lsa.LatentSemanticAnalysis.processSpace(LatentSemanticAnalysis.java:360)