Closed GoogleCodeExporter closed 9 years ago
The query above shouldn't work, since there are no queryterms after the command
line
flags. It might be that you're mistaking the -q argument for queryterms rather
than
queryfile.
Try java pitt.search.semanticvectors.Search -q termvectors.bin -l indexed_docs
test
I do note that the conditional you mention might break if the last argument is
-textindex, since this doesn't have a value after it. This is a bug.
Original comment by dwidd...@gmail.com
on 5 Jun 2008 at 6:01
Thanks so much for your reply. I apologize for not reading the documentation
properly!
The query does work, but it doesn't return any result. When I search for the
same
term using lucene it does return a document. Maybe I'm still doing the wrong
query?
Thanks for you help!
Original comment by saleandr...@gmail.com
on 6 Jun 2008 at 9:28
Sorry I didn't get back to you sooner on this. If you're using the test Bible
corpus
and default settings, I think (unfortunately) the term "test" doesn't make the
frequency cutoff. Try the term "abraham" instead, and let me know if this works!
Original comment by dwidd...@gmail.com
on 9 Jun 2008 at 2:34
thanks for your reply!
I'm actually using my own corpus. When searching using lucene I do find the
term, but
not with semanticvectors. Is that expected?
Also, I exported the terms and document vectors as texts, and for two documents
in my
corpus the values for all dimensions are NaN. I'll look into how these
documents are
different from the others and how the vectors are created, but maybe that
happened to
someone else before..? Thanks again.
Original comment by saleandr...@gmail.com
on 9 Jun 2008 at 2:38
Just to get back on the NaN problem: I'm just doing some tests, so I'm using a
very
small corpus and there were actually two documents which didn't have any terms
that
matched the termvector terms. So, the division on the
VectorUtils.getNormalizedVector
method was returning a NaN. Hope that helps and thanks for your attention!
Original comment by saleandr...@gmail.com
on 10 Jun 2008 at 4:49
It's usual that terms may be indexed by Lucene and fall below a frequency
cutoff used
by SemanticVectors. If you add the flag "-m 0" to the BuildIndex command, this
will
index all terms.
I don't understand yet how your NaN problem would be arising (this highlights
the
lack of unit tests for this project so far I'm afraid). Could you send me the
corpus
to see if I can replicate the NaN problem?
Original comment by dwidd...@gmail.com
on 10 Jun 2008 at 5:17
No problem, it is attached. They are just sample files, very few and very
simple. I
indexed using the default options (-m 10). Then, I exported the docvectors.bin
as
text, and I could see that for the document "gigs.txt" all values are NaN. I
think
this happened because on the VectorUtils.getNormalizedVector method, the "norm"
value
is zero, and the new normalized vector is created dividing by zero (tmpVec[i] =
tmpVec[i]/norm;). Maybe this just happened because the corpus was so small?
Also, I'm not sure how the docvector.bin can be used to compare documents with
each
other. Is that possible? Thanks again!
Original comment by saleandr...@gmail.com
on 10 Jun 2008 at 7:41
Attachments:
I think I should mark this as a "WontFix", since the thread dried up months ago
(my
fault) and we haven't seen recurrences of this problem.
Please contact me if we should still take action on this.
Original comment by dwidd...@gmail.com
on 26 Mar 2009 at 11:09
Original issue reported on code.google.com by
saleandr...@gmail.com
on 5 Jun 2008 at 5:27