fozziethebeat / S-Space

The S-Space repsitory, from the AIrhead-Research group
GNU General Public License v2.0
205 stars 106 forks source link

SVDLIBC generated the incorrect number of dimensions: 3 versus 300 #58

Closed kelvinAI closed 9 years ago

kelvinAI commented 9 years ago

Hi, I'm getting the above error when running LSAMain with the following commands: -d data/input2.txt data/output/my_lsa_output.sspace

input2.txt is just a very simple text file (for testing) and it contains: The man walked the dog. The man took the dog to the park. The dog went to the park.

System output: Saving matrix using edu.ucla.sspace.matrix.SvdlibcSparseBinaryMatrixBuilder@5e2de80c Saw 8 terms, 7 unique Saw 5 terms, 5 unique Saw 6 terms, 6 unique edu.ucla.sspace.lsa.LatentSemanticAnalysis@406a31db processing doc edu.ucla.sspace.util.SparseIntHashArray@2fae8f9 edu.ucla.sspace.lsa.LatentSemanticAnalysis@406a31db processing doc edu.ucla.sspace.util.SparseIntHashArray@3553305b edu.ucla.sspace.lsa.LatentSemanticAnalysis@406a31db processing doc edu.ucla.sspace.util.SparseIntHashArray@390b4f54 Jan 25, 2015 1:33:24 AM edu.ucla.sspace.common.GenericTermDocumentVectorSpace processSpace INFO: performing log-entropy transform Jan 25, 2015 1:33:24 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Computing the total row counts Jan 25, 2015 1:33:24 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Computing the entropy of each row Jan 25, 2015 1:33:24 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Scaling the entropy of the rows Jan 25, 2015 1:33:24 AM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace INFO: reducing to 300 dimensions Exception in thread "main" java.lang.RuntimeException: SVDLIBC generated the incorrect number of dimensions: 3 versus 300 at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionLibC.readSVDLIBCsingularVector(SingularValueDecompositionLibC.java:198) at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionLibC.factorize(SingularValueDecompositionLibC.java:161) at edu.ucla.sspace.lsa.LatentSemanticAnalysis.processSpace(LatentSemanticAnalysis.java:463) at edu.ucla.sspace.mains.GenericMain.processDocumentsAndSpace(GenericMain.java:514) at edu.ucla.sspace.mains.GenericMain.run(GenericMain.java:443) at edu.ucla.sspace.mains.LSAMain.main(LSAMain.java:167)

FYI the environment setup is : 64-bit Windows 7 , svdlibc compiled with cygwin. Is this issue caused by the input file? I've tried using a wiki dump corpus however the issue still exists. Any help is greatly appreciated.

Thank You

davidjurgens commented 9 years ago

I think the issue is that the input is only three documents but the command is trying to reduce the dimensionally to 300, which isn't possible (there's not enough data). If you tried to either reduce to two dimensions or to increase the number of terms/documents in the input corpus, the command should work.

Thanks, David

On Sat, Jan 24, 2015 at 12:46 PM, fingorn notifications@github.com wrote:

Hi, I'm getting the above error when running LSAMain with the following commands: -d data/input2.txt data/output/my_lsa_output.sspace

input2.txt is just a very simple text file (for testing) and it contains: The man walked the dog. The man took the dog to the park. The dog went to the park.

System output: Saving matrix using edu.ucla.sspace.matrix.SvdlibcSparseBinaryMatrixBuilder@5e2de80 https://github.com/edu.ucla.sspace.matrix.SvdlibcSparseBinaryMatrixBuilder/S-Space/commit/5e2de80c Saw 8 terms, 7 unique Saw 5 terms, 5 unique Saw 6 terms, 6 unique edu.ucla.sspace.lsa.LatentSemanticAnalysis@406a31d https://github.com/edu.ucla.sspace.lsa.LatentSemanticAnalysis/S-Space/commit/406a31db processing doc edu.ucla.sspace.util.SparseIntHashArray@2fae8f9 https://github.com/edu.ucla.sspace.util.SparseIntHashArray/S-Space/commit/2fae8f9 edu.ucla.sspace.lsa.LatentSemanticAnalysis@406a31d https://github.com/edu.ucla.sspace.lsa.LatentSemanticAnalysis/S-Space/commit/406a31db processing doc edu.ucla.sspace.util.SparseIntHashArray@3553305 https://github.com/edu.ucla.sspace.util.SparseIntHashArray/S-Space/commit/3553305b edu.ucla.sspace.lsa.LatentSemanticAnalysis@406a31d https://github.com/edu.ucla.sspace.lsa.LatentSemanticAnalysis/S-Space/commit/406a31db processing doc edu.ucla.sspace.util.SparseIntHashArray@390b4f5 https://github.com/edu.ucla.sspace.util.SparseIntHashArray/S-Space/commit/390b4f54 Jan 25, 2015 1:33:24 AM edu.ucla.sspace.common.GenericTermDocumentVectorSpace processSpace INFO: performing log-entropy transform Jan 25, 2015 1:33:24 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Computing the total row counts Jan 25, 2015 1:33:24 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Computing the entropy of each row Jan 25, 2015 1:33:24 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Scaling the entropy of the rows Jan 25, 2015 1:33:24 AM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace INFO: reducing to 300 dimensions Exception in thread "main" java.lang.RuntimeException: SVDLIBC generated the incorrect number of dimensions: 3 versus 300 at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionLibC.readSVDLIBCsingularVector(SingularValueDecompositionLibC.java:198) at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionLibC.factorize(SingularValueDecompositionLibC.java:161) at edu.ucla.sspace.lsa.LatentSemanticAnalysis.processSpace(LatentSemanticAnalysis.java:463) at edu.ucla.sspace.mains.GenericMain.processDocumentsAndSpace(GenericMain.java:514) at edu.ucla.sspace.mains.GenericMain.run(GenericMain.java:443) at edu.ucla.sspace.mains.LSAMain.main(LSAMain.java:167)

FYI the environment setup is : 64-bit Windows 7 , svdlibc compiled with cygwin. Is this issue caused by the input file? I've tried using a wiki dump corpus however the issue still exists. Any help is greatly appreciated.

Thank You

— Reply to this email directly or view it on GitHub https://github.com/fozziethebeat/S-Space/issues/58.

kelvinAI commented 9 years ago

Reducing the number of dimensions to 2 solved the issue of the small input corpus. Thank you.