crack521 / semanticvectors

Automatically exported from code.google.com/p/semanticvectors
Other
1 stars 0 forks source link

Possible bug on command line Search class #5

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Run a search using the command line Search class:
"java pitt.search.semanticvectors.Search -q test -s termvectors.bin -l
indexed_docs"

What is the expected output? What do you see instead?
On a previous version, I could do the search correctly. On the latest
version (1.10 and latest svn version), the usage() information text is
always displayed.

What version of the product are you using? On what operating system?
Ubuntu Hardy. Semanticvectors 1.10 and latest svn version.

Please provide any additional information below.
I'm a very recent user of both lucene and the semanticvectors package, so
forgive me if I'm too naive :). 
Using a previous version (before 1.10), I managed to successfully search
the term vectors. However, when I downloaded the latest version, I always
get the usage() information every time I run the
pitt.search.semanticvectors.Search class.
I notice that on line 135 of this file there is an "if" expression that was
recently added. Shouldn't this be outside the while loop and the while loop
look something like this: "while (argc < args.length &&
args[argc].substring(0, 1).equals("-"))" ?

Original issue reported on code.google.com by saleandr...@gmail.com on 5 Jun 2008 at 5:27

GoogleCodeExporter commented 9 years ago
The query above shouldn't work, since there are no queryterms after the command 
line
flags. It might be that you're mistaking the -q argument for queryterms rather 
than
queryfile.

Try java pitt.search.semanticvectors.Search -q termvectors.bin -l indexed_docs 
test

I do note that the conditional you mention might break if the last argument is
-textindex, since this doesn't have a value after it. This is a bug.

Original comment by dwidd...@gmail.com on 5 Jun 2008 at 6:01

GoogleCodeExporter commented 9 years ago
Thanks so much for your reply. I apologize for not reading the documentation 
properly!
The query does work, but it doesn't return any result. When I search for the 
same
term using lucene it does return a document. Maybe I'm still doing the wrong 
query? 
Thanks for you help!

Original comment by saleandr...@gmail.com on 6 Jun 2008 at 9:28

GoogleCodeExporter commented 9 years ago
Sorry I didn't get back to you sooner on this. If you're using the test Bible 
corpus
and default settings, I think (unfortunately) the term "test" doesn't make the
frequency cutoff. Try the term "abraham" instead, and let me know if this works!

Original comment by dwidd...@gmail.com on 9 Jun 2008 at 2:34

GoogleCodeExporter commented 9 years ago
thanks for your reply! 
I'm actually using my own corpus. When searching using lucene I do find the 
term, but
not with semanticvectors. Is that expected?

Also, I exported the terms and document vectors as texts, and for two documents 
in my
corpus the values for all dimensions are NaN. I'll look into how these 
documents are
different from the others and how the vectors are created, but maybe that 
happened to
someone else before..? Thanks again.

Original comment by saleandr...@gmail.com on 9 Jun 2008 at 2:38

GoogleCodeExporter commented 9 years ago
Just to get back on the NaN problem: I'm just doing some tests, so I'm using a 
very
small corpus and there  were actually two documents which didn't have any terms 
that
matched the termvector terms. So, the division on the 
VectorUtils.getNormalizedVector
method was returning a NaN. Hope that helps and thanks for your attention!

Original comment by saleandr...@gmail.com on 10 Jun 2008 at 4:49

GoogleCodeExporter commented 9 years ago
It's usual that terms may be indexed by Lucene and fall below a frequency 
cutoff used
by SemanticVectors. If you add the flag "-m 0" to the BuildIndex command, this 
will
index all terms.

I don't understand yet how your NaN problem would be arising (this highlights 
the
lack of unit tests for this project so far I'm afraid). Could you send me the 
corpus
to see if I can replicate the NaN problem? 

Original comment by dwidd...@gmail.com on 10 Jun 2008 at 5:17

GoogleCodeExporter commented 9 years ago
No problem, it is attached. They are just sample files, very few and very 
simple. I
indexed using the default options (-m 10). Then, I exported the docvectors.bin 
as
text, and I could see that for the document "gigs.txt" all values are NaN. I 
think
this happened because on the VectorUtils.getNormalizedVector method, the "norm" 
value
is zero, and the new normalized vector is created dividing by zero (tmpVec[i] =
tmpVec[i]/norm;). Maybe this just happened because the corpus was so small?

Also, I'm not sure how the docvector.bin can be used to compare documents with 
each
other. Is that possible? Thanks again!

Original comment by saleandr...@gmail.com on 10 Jun 2008 at 7:41

Attachments:

GoogleCodeExporter commented 9 years ago
I think I should mark this as a "WontFix", since the thread dried up months ago 
(my
fault) and we haven't seen recurrences of this problem.

Please contact me if we should still take action on this.

Original comment by dwidd...@gmail.com on 26 Mar 2009 at 11:09