dileepajayakody / semanticvectors

Automatically exported from code.google.com/p/semanticvectors
Other
1 stars 0 forks source link

pitt.search.semanticvectors.BuildIndex encounters a null pointer exception in main() #19

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1.  After running the Lucene demo command to build an index (in directory
./index), I run pitt.search.semanticvectors.BuildIndex, but I get a null
pointer exception:

$>java -cp "e:\text\lucene-2.9.0\lucene-core-2.9.0.jar;e:\text\lucene-2.9.0
\lucene-demos-2.9.0.jar;e:\text\semanticvectors\semanticvectors-1.24.jar" 
pitt.search.semanticvectors.BuildIndex index
Seedlength = 10
Dimension = 200
Minimum frequency = 0
Number non-alphabet characters = 0
Contents fields are: [contents]
Creating semantic term vectors ...
Populating basic sparse doc vector store, number of vectors: 6
Creating store of sparse vectors  ...
Created 6 sparse random vectors.
Creating term vectors ...
There are 1522 terms (and 6 docs)
0 ... Exception in thread "main" java.lang.NullPointerException
        at
org.apache.lucene.index.DirectoryReader$MultiTermDocs.freq(DirectoryReader.java:
1068)
        at
pitt.search.semanticvectors.LuceneUtils.getGlobalTermFreq(LuceneUtils.java:70)
        at
pitt.search.semanticvectors.LuceneUtils.termFilter(LuceneUtils.java:187)
        at
pitt.search.semanticvectors.TermVectorsFromLucene.<init>(TermVectorsFromLucene.j
ava:163)
        at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:138)

What is the expected output? What do you see instead?

What version of the product are you using? On what operating system?

=> product version  = 1.24
=> operating system = Windows 7
=> java version     = 1.6.0_14

Please provide any additional information below.

=> These results are from running under cygwin, but I get the identical
responses if I bring up a DOS, er, Windows command line and run the commands.

=> I have tried all variations on passing the directory path to
BuildIndex.main (index, ./index, ".\\index", the full path) and I get the
same result.

=> The steps leading up to this are:

(A) I have 6 small text files in a subdirectory named "chunked", and I run
the Lucene demo commands to index them; the command and its output look
like this:
$>java -cp "e:\text\lucene-2.9.0\lucene-core-2.9.0.jar;e:\text\lucene-2.9.0
\lucene-demos-2.9.0.jar" org.apache.lucene.demo.IndexFiles  chunked
Indexing to directory 'index'...
adding chunked\GTG-Wailea.1.txt
adding chunked\GTG-Wailea.2.txt
adding chunked\GTG-Wailea.3.txt
adding chunked\GTG-Wailea.4.txt
adding chunked\HomeCookingStar.txt
adding chunked\Shaka.txt
Optimizing...
606 total milliseconds

(B) And if I look in the "index" directory, I see this, which looks normal:

$>ls index
_0.cfs*  _0.cfx*  segments.gen*  segments_2*

(C) For verification that the Lucene processing went OK, I can search for
the word "golf", which I already know appears in exactly 5 of the documents:

$>java -cp "e:\text\lucene-2.9.0\lucene-core-2.9.0.jar;e:\text\lucene-2.9.0
\lucene-demos-2.9.0.jar" org.apache.lucene.demo.SearchFiles
Enter query:
golf
Searching for: golf
5 total matching documents
1. chunked\GTG-Wailea.2.txt
2. chunked\GTG-Wailea.1.txt
3. chunked\GTG-Wailea.4.txt
4. chunked\GTG-Wailea.3.txt
5. chunked\Shaka.txt
Press (q)uit or enter number to jump to a page.

Original issue reported on code.google.com by seanlan...@gmail.com on 9 Nov 2009 at 12:09

GoogleCodeExporter commented 9 years ago
That's strange. Looks like getGlobalTermFreq is getting a null result for 
TermDocs 
tDocs = this.indexReader.termDocs(term). I haven't succeeded in reproducing 
this 
error, leaving the unsatisfactory conclusion that "there's something weird 
about your 
data", which isn't a very helpful thing to say.

I've submitted a small workaround that checks for a null pointer here and 
returns "1" 
if we can't get a better term weight. Could you please check if this works for 
you? 
If compiling from source is a hassle, let me know and I'll send you a jar file.

The downside of this is that it won't fix the problem of there being no 
available 
term weights for document vector creation. It'll just set all such weights to 1.

If we want to track this down further, we should find out which terms are 
causing 
problems. I could help with this if you want to send me copies of the documents 
in 
question.

Thanks for the detailed report.
-Dominic

Original comment by dwidd...@gmail.com on 9 Nov 2009 at 7:06

GoogleCodeExporter commented 9 years ago
That was fast -- thanks for the quick feedback!  I will be able to get to look 
at
this sometime tomorrow -- there have been too many brush fires already today on 
my
day job :(

If there is something unusual about the data, that will be very good for me to 
know,
because these are typical of the text streams I expect to gather, and I need to 
know
how to clean them up.

And I will go ahead and set up my build environment and compile the sources -- I
wanted to get to that point anyway.

So I will try it out as soon as I get a chance, and I will let you know what I 
see!

Thanks again --

-- Sean

Original comment by seanlan...@gmail.com on 9 Nov 2009 at 9:11

GoogleCodeExporter commented 9 years ago
Got it!

Thanks for the pointer, Dominic, on where to go looking -- I was able to track 
down
the null pointer very quickly (printf debugging rulz!).

Actually, it is a somewhat misleading Exception out of Lucene.  The problem, it 
turns
out, isn't line 69 in LuceneUtils.java: "TermDocs tDocs =
this.indexReader.termDocs(term);", but rather line 70: "tf = tDocs.freq();".  
And
yes, the symptom is the null pointer exception that gets bubbled up, but the 
real
cause appears to be the call to TermDocs.freq(), which isn't valid until
TermDocs.next() is called for the first time.

So I have commented out that line of code (line 70), and now LuceneUtils 
compiles and
runs just fine.

Thanks for your help!

Original comment by seanlan...@gmail.com on 10 Nov 2009 at 12:47

GoogleCodeExporter commented 9 years ago
Golly, that's strange.

I'm not surprised by the NPE behaviour, one of the strange things with Java is 
that
if you call a method that's supposed to return an object (or really a pointer 
to that
object on the heap), you won't get an exception if that method returns null - 
instead
you'll get a NPE the first time you try to use the object for something.

What I am surprised at is the behaviour you're seeing with needing to call
TermDocs.next() before TermDocs.freq(). This is something I'd like to track 
down if
possible at some point, I don't know why we haven't seen this problem before.

I guess there's no excuse by now for not having unit tests for the LuceneUtils 
class.
It's great that you got to the root of your problem with printf debugging, I 
agree
that it rules and rocks, but the fact that you have to do it at all with
SemanticVectors is irksome!

OK, for now I'm marking this issue as "Fixed" (developer has submitted fixing 
code)
but not "Verified" (QA / testing should still verify further). Does that sound
reasonable to you?

Best wishes,
Dominic

Original comment by dwidd...@gmail.com on 10 Nov 2009 at 1:17

GoogleCodeExporter commented 9 years ago
Perfect!  Fixed but not Verified sounds just right.

Yeah, it seems to me, too, that since one of the main features of the TermDocs
instance is that it can iterate terms, the iteration code ought to be ready to, 
well,
iterate.  But I guess it is what it is...

Anyway, thanks again for all the help!

Original comment by seanlan...@gmail.com on 10 Nov 2009 at 9:58