fozziethebeat / S-Space

The S-Space repsitory, from the AIrhead-Research group
GNU General Public License v2.0
203 stars 106 forks source link

Exception using Matlab for LSA #26

Open N2D2 opened 11 years ago

N2D2 commented 11 years ago

Hi, if I put the -S MATLAB command, Java throws IllegalArgumentException: dimensions must be positive at edu.ucla.sspace.matrix.OnDiskMatrix.(OnDiskMatrix.java:98) at edu.ucla.sspace.matrix.Matrices.create(Matrices.java:216) at edu.ucla.sspace.matrix.MatrixIO.readDenseTextMatrix(MatrixIO.java:924)

In Matrices.create(Matrices.java:216) I find the lines:

case SPARSE_ON_DISK: //return new SparseOnDiskMatrix(rows, cols); // REMDINER: implement me return new OnDiskMatrix(rows, cols);

Is the MATLAB matrix format not implemented yet or is it a bug?

The output above the Exception suggests a general problem with reading the Matlab-Output:

Nov 09, 2012 11:21:52 AM edu.ucla.sspace.matrix.MatrixIO readDenseTextMatrix FINE: reading in text matrix with 15262 rows and 0 cols Nov 09, 2012 11:21:52 AM edu.ucla.sspace.matrix.MatrixIO readDenseTextMatrix FINE: reading in text matrix with 100 rows and 0 cols

And Matlab gives the warning:

Warning: Imaginary part of complex variable 'U' not saved to ASCII file. Warning: Imaginary part of complex variable 'V' not saved to ASCII file. But the three Matlab output matrices seem normal.

I use Mac 10.7, Matlab 2012a and the SSpace 2.0-Code (but this happened with earlier code, too)

N2D2 commented 11 years ago

I found the solution: The java.util.Scanner class is used to import the Matlab-Files. This class is dependent on the language of the Java environment. The normal english-version of the JRE interprets the numbers (e. g. 7.4566000e+07), generated by Matlab, correctly. But in some Non-US-Versions this strings aren't interpreted as numbers, because there is a comma used instead of the decimal point: 7,4566000e+07. In future versions you might add java.util.Scanner.useLocale(new Locale("en", "US")) function to the Scanner Object to overcome this issue.

fozziethebeat commented 11 years ago

Awesome find! If you wan to send us a pull request, i'll be more than happy to merge this little fix :)

ganonp commented 11 years ago

I seem to be having a similar problem:

Mar 23, 2013 2:32:56 PM edu.ucla.sspace.common.GenericTermDocumentVectorSpace pr ocessSpace INFO: performing log-entropy transform Mar 23, 2013 2:32:56 PM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlo balTransform INFO: Computing the total row counts Mar 23, 2013 2:32:56 PM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlo balTransform INFO: Computing the entropy of each row Mar 23, 2013 2:32:56 PM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlo balTransform INFO: Scaling the entropy of the rows Mar 23, 2013 2:32:56 PM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace INFO: reducing to 300 dimensions Exception in thread "main" java.lang.IllegalArgumentException: dimensions must b e positive at edu.ucla.sspace.matrix.OnDiskMatrix.(OnDiskMatrix.java:98) at edu.ucla.sspace.matrix.Matrices.create(Matrices.java:216) at edu.ucla.sspace.matrix.MatrixIO.readDenseTextMatrix(MatrixIO.java:924 ) at edu.ucla.sspace.matrix.MatrixIO.readMatrix(MatrixIO.java:794) at edu.ucla.sspace.matrix.MatrixIO.readMatrix(MatrixIO.java:761) at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionMatlab .factorize(SingularValueDecompositionMatlab.java:137) at edu.ucla.sspace.lsa.LatentSemanticAnalysis.processSpace(LatentSemanti cAnalysis.java:439) at edu.ucla.sspace.mains.GenericMain.processDocumentsAndSpace(GenericMai n.java:514) at edu.ucla.sspace.mains.GenericMain.run(GenericMain.java:443) at edu.ucla.sspace.mains.LSAMain.main(LSAMain.java:167)

I use Win7, MatLab2010a, and the current version of sspace (2.03).

ganonp commented 11 years ago

All three of the matlab output matrices remain empty. Does anyone know why this might be the case?

davidjurgens commented 11 years ago

I think we've tested with 2010a, but I don't think we saw this behavior. Are you using Windows as well? If possible, is there any way to give us a test corpus that reproduces the issue? Keith and I are finally geting around to pushing out some new changes, so we could get this fixed soon once we're able to reproduce the behavior.

Thanks, David

On Sun, Mar 24, 2013 at 12:40 AM, ganonp notifications@github.com wrote:

All three of the matlab output matrices remain empty. Does anyone know why this might be the case?

— Reply to this email directly or view it on GitHubhttps://github.com/fozziethebeat/S-Space/issues/26#issuecomment-15348493 .

N2D2 commented 11 years ago

It might well be that the corpus is too small. If you do not have some more documents than dimensions, matlab cannot compute the SVD.

ganonp commented 11 years ago

Thank you for your help!

So the problem originally occurred with a rather large corpus (several thousand lines in an ~150 megabyte .txt file), since then I've been using a smaller version and messing with the code in an attempt to remedy the problem.

I have actually taken the matrix file (via matrix.getAbsolutePath()) and run it through matlab manually using the same code found in the matlab script in the SingularValueDecompositionMatlab.java document and it has no problem producing any of the three uOutput, sOutput, or vOutput files correctly (at least not blank). I've never run a java program that accessed matlab, so it's possible there is some issue in my environment variables? Every time I run the program, matlab opens up to a command window, no script is visibly implemented, and the uOutput, sOutput, and vOutput files remain blank, so I believe it's an issue of actually getting matlab to run the script...

davidjurgens commented 11 years ago

Wow, that really helps track down the issue! My guess is that there is an issue with how our code is trying to invoke Matlab from the command line Win7. We never had a Win7 system to test the code on, so there's a good chance we might not be handling the command line call correctly. I'll see if I can track down a machine here at work and do some testing.

We've tried hard to make sure all the code is platform independent, so it's important to fix this kind of bug. :)

Thanks, David

On Sun, Mar 24, 2013 at 4:52 PM, ganonp notifications@github.com wrote:

Thank you for your help!

So the problem originally occurred with a rather large corpus (several thousand lines in an ~150 megabyte .txt file), since then I've been using a smaller version and messing with the code in an attempt to remedy the problem.

I have actually taken the matrix file (via matrix.getAbsolutePath()) and run it through matlab manually using the same code found in the matlab script in the SingularValueDecompositionMatlab.java document and it has no problem producing any of the three uOutput, sOutput, or vOutput files correctly (at least not blank). I've never run a java program that accessed matlab, so it's possible there is some issue in my environment variables? Every time I run the program, matlab opens up to a command window, no script is visibly implemented, and the uOutput, sOutput, and vOutput files remain blank, so I believe it's an issue of actually getting matlab to run the script...

— Reply to this email directly or view it on GitHubhttps://github.com/fozziethebeat/S-Space/issues/26#issuecomment-15361143 .

ganonp commented 11 years ago

Hey, no problem! I'm thankful someone is putting this type of software out open source, it's going to be really helpful for some projects I'm working on. Let me know if there's anything I can do to help!

Ganon

N2D2 commented 11 years ago

Hi, I have looked at my code and it may be the same problem I described. The exception is thrown by reading the term-document matrix. At this point matlab has not yet been used. If you use a Non-US-Version of Java, you have to insert a line like this: java.util.Locale.setDefault(new Locale("US")); before the term-doc matrix is read in (before MatrixIO.readMatrix(File matrix, Format format, Type matrixType, boolean transposeOnRead)) or use my pull-request https://github.com/fozziethebeat/S-Space/pull/31

davidjurgens commented 11 years ago

Whoops, I thought we had integrated that pull request! I'll make sure that gets integrated in the next release for certain. There's probably a few other Locale-based bugs so I'll make scan for others as well.

On Sun, Mar 24, 2013 at 8:06 PM, N2D2 notifications@github.com wrote:

Hi, I have looked at my code and it may be the same problem I described. The exception is thrown by reading the term-document matrix. At this point matlab has not yet been used. If you use a Non-US-Version of Java, you have to insert a line like this: java.util.Locale.setDefault(new Locale("US")); before the term-doc matrix is read in (before MatrixIO.readMatrix(File matrix, Format format, Type matrixType, boolean transposeOnRead)) or use my pull-request #31https://github.com/fozziethebeat/S-Space/issues/31

— Reply to this email directly or view it on GitHubhttps://github.com/fozziethebeat/S-Space/issues/26#issuecomment-15366184 .

ganonp commented 11 years ago

See that was my initial thought too, which is why I posted here. I can't imagine why I wouldn't have a US-Version of Java, but I went ahead and changed the scanner objects in the readDenseTextMatrix method in the MatrixIO class to

"Scanner s = new Scanner(line).useLocale(new Locale("en", "US"));" around line 897

and

"Scanner scanner = new Scanner(matrix).useLocale(new Locale("en", "US"));" around line 928

Upon doing this I added:

"System.out.println("reading in text matrix with " + rows + " rows and " + cols + " cols");"

So I could evaluate whether it was actually counting anything. and it was not - ie it was returning 0 rows and -1 cols.

From here I looked at what the actual file "matrix" was referring to, and it appears to be the sOutput and the uOutput from the SingularValueDecompositionMatlab class in the factorize method.

Now, these are the two files that are empty, however they are not empty when I run the matlab code manually using the exact same MatrixFile used in the factorize method manually. This is what is leading me to believe that it's a problem with the interface between matlab and java/command line. It is also these two files which the scanner is using to count rows and columns - unless I missed something.

I've been trying to evaluate if there is another step where the termdocumentvectorspace variable (from the generictermdocumentvectorspace class) is used to determine columns and rows but I can't find one, but I'm also not that great at java :p

N2D2 commented 11 years ago

Ok, that means the Term-Document-Matrix is read in correctly? Than that is not the point. I remember, it was a struggle to set the environment variables for matlab. Finally I put an alias-file of matlab to /usr/bin under Mac OS. Without this, I had the same problem. Have you set matlab in the PATH environment? In my memory I had the situation, matlab starts normally in the terminal, but a java-program in the same terminal did not find it, until I put the matlab-aliases. It may be true that I did something more, e. g. set a path-variable, but I do not remember my steps exactly.

ganonp commented 11 years ago

It appears to me that it is read in correctly, though as I said, I could be mistaken. I have set my path environment variable to C:\Program Files (x86)\MATLAB\R2010a Student\bin. I've done some googling on this and can't seem to find anything else.

N2D2 commented 11 years ago

1) "Every time I run the program, matlab opens up to a command window, no script is visibly implemented" Does it mean matlab starts a new window or the program go into a matlab environment? That is wrong. Matlab is started with -nodisplay, you see only the matlab-output. You can see the normal output at the bottom.

2) You checked that the Term-Document-Matrix, at my output the matlab-sparse-matrix6106521009675059906.dat.matrix-transform2454813125364315471.dat file, is successful created?! The file-content looks like: 1 1 0.637382 2 1 0.526688 3 1 0.507444 4 1 0.491153 5 1 0.812426 6 1 0.373287 7 1 0.651494

3) I remember my main problem with the PATH-alias. I had to set the alias under mac os to the command line program. Under Mac OS all files are encapsulated in the .app-file. So I had to look into this app-file and set the reference to /Applications/MATLAB_R2012a.app/bin/matlab (Dip into an app-archive is an unusual proceeding). I don´t know the windows-config of matlab, but maybe you point to the wrong matlab-part, not the terminal-program.

I quote things I changed, that has probably nothing to do with your concrete problem, but can help in the future. 4) In the SingularValueDecompositionMatlab.java file, I add some options for matlab, because the standard values are not sufficient. I changed this line: "[U, S, V] = svds(A, " + dimensions + " );\n" + to

"opts.maxit = 2000;\n" + "opts.tol = 1e-55;\n" + "[U, S, V] = svds(A, " + dimensions + ",'L',opts);\n" +

5) Look at https://github.com/fozziethebeat/S-Space/issues/28 : If the matrix is to small, than the computed singular-values of matlab are less than the requested matrix size and the program run into an exception. So I fill up the singular values with very small numbers, that is not the elegant way, but change nothing and make the code stable. (for that reason, the changes at 4)): This line in SingularValueDecompositionMatlab.java is changed:

for (int s = 0; s < dimensions; ++s) singularValues[s] = S.get(s, s);

with:

double lastNotNull=0; for (int s = 0; s < dimensions; ++s) { if(s < S.rows()) { singularValues[s] = S.get(s, s); if(singularValues[s] != 0.0) { lastNotNull = singularValues[s]; } } else { lastNotNull = lastNotNull-(lastNotNull/61.0d); singularValues[s] = lastNotNull; } }

61 was the average decrease for my matrices. Remember that this are the smallest and nonrelevant singular values. If this code is reached, you want to get too many dimensions from a too small corpus!

6) See: https://github.com/fozziethebeat/S-Space/issues/30 so I changed in the AbstractSVD the line

103 dataClasses.set(r, c, U.get(r, c) * singularValues[c]); with dataClasses.set(r, c, U.get(r, c) * (1.0d/singularValues[c]));

and

124 classFeatures.set(r, c, V.get(r, c) * singularValues[r]); with classFeatures.set(r, c, V.get(r, c) *(1.0d/singularValues[r]));

but at this point I am not sure, that I am right. Maybe I do not understand the code completely, but only with this changes I can reproduce the LSI-example from Landauer et al.

7) See https://github.com/fozziethebeat/S-Space/issues/27

That are all my changes, with this adjustments you have a stable LSA-implementation.

########### Here the output of a successful run:

`Mrz 26, 2013 11:04:27 AM edu.ucla.sspace.mains.LSAMain verbose FINE: parsed document #74509 in 0,000 seconds ... Mrz 26, 2013 11:04:27 AM edu.ucla.sspace.mains.LSAMain verbose FINE: Processed all 74514 documents in 13,716 total seconds Mrz 26, 2013 11:04:27 AM edu.ucla.sspace.matrix.MatlabSparseMatrixBuilder finish FINE: Finished writing matrix in MATLAB_SPARSE format with 74449 columns Mrz 26, 2013 11:04:27 AM edu.ucla.sspace.common.GenericTermDocumentVectorSpace processSpace INFO: performing log-entropy transform Mrz 26, 2013 11:04:27 AM edu.ucla.sspace.common.GenericTermDocumentVectorSpace processSpace FINE: stored term-document matrix in format MATLAB_SPARSE at /var/folders/zg/4q1zhr211175mbhhfbcn45lc0000gn/T/matlab-sparse-matrix6106521009675059906.dat Mar 26, 2013 11:04:28 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Computing the total row counts Mar 26, 2013 11:04:36 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Computing the entropy of each row Mar 26, 2013 11:04:38 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform INFO: Scaling the entropy of the rows Mar 26, 2013 11:04:47 AM edu.ucla.sspace.common.GenericTermDocumentVectorSpace processSpace FINE: transformed matrix to /var/folders/zg/4q1zhr211175mbhhfbcn45lc0000gn/T/matlab-sparse-matrix6106521009675059906.dat.matrix-transform2454813125364315471.dat Mar 26, 2013 11:04:47 AM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace INFO: reducing to 5 dimensions Mar 26, 2013 11:04:47 AM edu.ucla.sspace.matrix.factorization.SingularValueDecompositionMatlab factorize FINE: writing Matlab output to files: /var/folders/zg/4q1zhr211175mbhhfbcn45lc0000gn/T/matlab-svds-U2486854453852387586.dat /var/folders/zg/4q1zhr211175mbhhfbcn45lc0000gn/T/matlab-svds-S2720750310155449077.dat /var/folders/zg/4q1zhr211175mbhhfbcn45lc0000gn/T/matlab-svds-V7780446078228956827.dat

Mar 26, 2013 11:04:47 AM edu.ucla.sspace.matrix.factorization.SingularValueDecompositionMatlab factorize FINE: matlab -nodisplay -nosplash -nojvm Mar 26, 2013 11:05:10 AM edu.ucla.sspace.matrix.factorization.SingularValueDecompositionMatlab factorize FINE: Matlab svds output: < M A T L A B (R) > Copyright 1984-2012 The MathWorks, Inc. R2012a (7.14.0.739) 64-bit (maci64) February 9, 2012

To get started, type one of these: helpwin, helpdesk, or demo. For product information, visit www.mathworks.com.

>> >> >> >> >> >> >> >> >> >> >> >> Matlab Finished >>

Mar 26, 2013 11:05:10 AM edu.ucla.sspace.matrix.factorization.SingularValueDecompositionMatlab factorize FINE: Matlab svds exit status: 0 Mar 26, 2013 11:05:26 AM edu.ucla.sspace.matrix.MatrixIO readDenseTextMatrix FINE: reading in text matrix with 68243 rows and 5 cols Mar 26, 2013 11:05:29 AM edu.ucla.sspace.matrix.MatrixIO readDenseTextMatrix FINE: reading in text matrix with 5 rows and 5 cols Mar 26, 2013 11:05:43 AM edu.ucla.sspace.matrix.MatrixIO readDenseTextMatrix FINE: reading in text matrix with 74449 rows and 5 cols Mar 26, 2013 11:05:46 AM edu.ucla.sspace.mains.LSAMain verbose FINE: processed space in 78.274 seconds output File: m.out Mar 26, 2013 11:05:46 AM edu.ucla.sspace.common.SemanticSpaceIO writeText FINE: saving text S-Space with 68243 words with 5-dimensional vectors Mar 26, 2013 11:05:48 AM edu.ucla.sspace.mains.LSAMain verbose FINE: printed space in 1.921 seconds`

jerrygaoLondon commented 9 years ago

I've encountered the same problem (IllegalArgumentException: dimensions must be positive) when running dimension reduction by using SingularValueDecompositionMatlab or SingularValueDecompositionOctave in Windows 7.

I have matlab installed. The problem is exactly the same as the one described by @ganonp . I can see "Every time I run the program, matlab opens up to a command window,". However, no data can be written into matlab-svds-VXXX.dat, matlab-svds-SXXX.dat and matlab-svds-UXXX.dat.

When i run the script (as below) in matlab, it works well. I suspect that there is an issue to make matlab script run correctly by writing scripts to matlab output stream (as the code in line 102 in SingularValueDecompositionMatlab.java). I have tried the version 2.0.4 and 2.0.3 and none of them works. Any ideas?

Z=load('C:\Users\jerry\AppData\Local\Temp\matlab-input3613993774554135994.dat','-ascii'); A = spconvert(Z); clear Z; [U, S, V] = svds(A, 3); save C:\Users\jerry\AppData\Local\Temp\matlab-svds-U8365859529350808097.dat U -ASCII save C:\Users\jerry\AppData\Local\Temp\matlab-svds-S1129545791491312334.dat S -ASCII save C:\Users\jerry\AppData\Local\Temp\matlab-svds-V4284815455262646654.dat V -ASCII fprintf('Matlab Finished\n');