Open DeepthiKarnam opened 8 years ago
Hi Deepthi,
Which version of the code are you using? It looks like the stack trace you have is using SvdlibJ, which we haven't supported for some time (their implementation is known to have errors in its SVD results). The latest code should definitely support reducing from 6000 dimensions to 300. How many documents are in your corpus?
Thanks, David
On Wed, Feb 17, 2016 at 10:26 PM, DeepthiKarnam notifications@github.com wrote:
Feb 18, 2016 11:28:35 AM edu.ucla.sspace.common.GenericTermDocumentVectorSpace processSpace INFO: performing log-entropy transform Feb 18, 2016 11:28:35 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Computing the total row counts Feb 18, 2016 11:28:35 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Computing the entropy of each row Feb 18, 2016 11:28:35 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Scaling the entropy of the rows Feb 18, 2016 11:28:35 AM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace INFO: reducing to 300 dimensions Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException at edu.ucla.sspace.matrix.DiagonalMatrix.checkIndices(DiagonalMatrix.java:78) at edu.ucla.sspace.matrix.DiagonalMatrix.get(DiagonalMatrix.java:94) at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionLibJ.factorize(SingularValueDecompositionLibJ.java:89)
The number of words in my corpus turns to be 6000+. Is the code unable to reduce the size of the vector to 300 from 6000+. What is the solution ?
— Reply to this email directly or view it on GitHub https://github.com/fozziethebeat/S-Space/issues/70.
Hi David, Thanks for your prompt reply. I have a few more questions in continuation to the above.
I am using the jar "sspace-wordsi-2.0-jar-with-dependencies.jar". Is this not supported ?
Currently, I am running on a sample of size 200 documents. However, the entire corpus is around 9000 documents. Is it scalable ?
Each document is a pdf with close to ~500 words per document (without preprocessing). I am doing a simple preprocessing to remove stopwords and special characters from the text. Do you think, any additional preprocessing will help such as lemmatization ?
Tried using sspace-2.0.1.jar Problem persists :'(
Feb 18, 2016 12:35:44 PM edu.ucla.sspace.common.GenericTermDocumentVectorSpace processSpace
INFO: performing log-entropy transform
Feb 18, 2016 12:35:44 PM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform
Feb 18, 2016 11:28:35 AM edu.ucla.sspace.common.GenericTermDocumentVectorSpace processSpace INFO: performing log-entropy transform Feb 18, 2016 11:28:35 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform
INFO: Computing the total row counts
Feb 18, 2016 11:28:35 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform
INFO: Computing the entropy of each row
Feb 18, 2016 11:28:35 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform
INFO: Scaling the entropy of the rows
Feb 18, 2016 11:28:35 AM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace
INFO: reducing to 300 dimensions
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
at edu.ucla.sspace.matrix.DiagonalMatrix.checkIndices(DiagonalMatrix.java:78)
at edu.ucla.sspace.matrix.DiagonalMatrix.get(DiagonalMatrix.java:94)
at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionLibJ.factorize(SingularValueDecompositionLibJ.java:89)
The number of words in my corpus turns to be 6000+. Is the code unable to reduce the size of the vector to 300 from 6000+. What is the solution ?