Closed GoogleCodeExporter closed 9 years ago
That's strange. Looks like getGlobalTermFreq is getting a null result for
TermDocs
tDocs = this.indexReader.termDocs(term). I haven't succeeded in reproducing
this
error, leaving the unsatisfactory conclusion that "there's something weird
about your
data", which isn't a very helpful thing to say.
I've submitted a small workaround that checks for a null pointer here and
returns "1"
if we can't get a better term weight. Could you please check if this works for
you?
If compiling from source is a hassle, let me know and I'll send you a jar file.
The downside of this is that it won't fix the problem of there being no
available
term weights for document vector creation. It'll just set all such weights to 1.
If we want to track this down further, we should find out which terms are
causing
problems. I could help with this if you want to send me copies of the documents
in
question.
Thanks for the detailed report.
-Dominic
Original comment by dwidd...@gmail.com
on 9 Nov 2009 at 7:06
That was fast -- thanks for the quick feedback! I will be able to get to look
at
this sometime tomorrow -- there have been too many brush fires already today on
my
day job :(
If there is something unusual about the data, that will be very good for me to
know,
because these are typical of the text streams I expect to gather, and I need to
know
how to clean them up.
And I will go ahead and set up my build environment and compile the sources -- I
wanted to get to that point anyway.
So I will try it out as soon as I get a chance, and I will let you know what I
see!
Thanks again --
-- Sean
Original comment by seanlan...@gmail.com
on 9 Nov 2009 at 9:11
Got it!
Thanks for the pointer, Dominic, on where to go looking -- I was able to track
down
the null pointer very quickly (printf debugging rulz!).
Actually, it is a somewhat misleading Exception out of Lucene. The problem, it
turns
out, isn't line 69 in LuceneUtils.java: "TermDocs tDocs =
this.indexReader.termDocs(term);", but rather line 70: "tf = tDocs.freq();".
And
yes, the symptom is the null pointer exception that gets bubbled up, but the
real
cause appears to be the call to TermDocs.freq(), which isn't valid until
TermDocs.next() is called for the first time.
So I have commented out that line of code (line 70), and now LuceneUtils
compiles and
runs just fine.
Thanks for your help!
Original comment by seanlan...@gmail.com
on 10 Nov 2009 at 12:47
Golly, that's strange.
I'm not surprised by the NPE behaviour, one of the strange things with Java is
that
if you call a method that's supposed to return an object (or really a pointer
to that
object on the heap), you won't get an exception if that method returns null -
instead
you'll get a NPE the first time you try to use the object for something.
What I am surprised at is the behaviour you're seeing with needing to call
TermDocs.next() before TermDocs.freq(). This is something I'd like to track
down if
possible at some point, I don't know why we haven't seen this problem before.
I guess there's no excuse by now for not having unit tests for the LuceneUtils
class.
It's great that you got to the root of your problem with printf debugging, I
agree
that it rules and rocks, but the fact that you have to do it at all with
SemanticVectors is irksome!
OK, for now I'm marking this issue as "Fixed" (developer has submitted fixing
code)
but not "Verified" (QA / testing should still verify further). Does that sound
reasonable to you?
Best wishes,
Dominic
Original comment by dwidd...@gmail.com
on 10 Nov 2009 at 1:17
Perfect! Fixed but not Verified sounds just right.
Yeah, it seems to me, too, that since one of the main features of the TermDocs
instance is that it can iterate terms, the iteration code ought to be ready to,
well,
iterate. But I guess it is what it is...
Anyway, thanks again for all the help!
Original comment by seanlan...@gmail.com
on 10 Nov 2009 at 9:58
Original issue reported on code.google.com by
seanlan...@gmail.com
on 9 Nov 2009 at 12:09