crack521 / semanticvectors

Automatically exported from code.google.com/p/semanticvectors
Other
1 stars 0 forks source link

CompareTerms and Search have side effects with VectorStores in RAM #13

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
After doing several comparisons or searches with a VectorStoreRAMCache,
results start to be NaN that weren't previously. This is a problem for
running large numbers of batch queries in experiments.

Would like to fix, but has proved difficult to track down, so won't fix in
version 1.18 but should mark as noted.

Original issue reported on code.google.com by dwidd...@gmail.com on 26 Mar 2009 at 11:12

GoogleCodeExporter commented 9 years ago
OK,

As mentioned in the e-mail today, I've had a look and I think I have a fix for 
the
problems I've been seeing with CompareTerms. It's not however some intricacy of
VectorStoreRAM. I found that the problem was due to having the term 'NOT' in 
the list
of terms, and eventually tracked it down the normalizing of the vectors in the
orthogonalization process that happens when there is a negated section in the 
terms list.

Basically I think sometimes you get floating point rounding errors (most likely 
only
when you throw pathologically long documents into CompareTerms as I am wont to 
do).
Of course, when you then try and normalize a zero-filled vector, NaNs occur and
propagate everywhere. I'm not 100% sure on why this then puts the VectorStoreRAM
instance into a state where it persistently produces these errors, but maybe 
you'll
have a better idea of that.

Anyway, I've got a small patch attached that stops NaNs occurring by setting a
normalized vector of entirely zeroes to also be zero. I'm not sure whether this 
is
sensible or not (It's just occurred to me maybe you want NaNs there), but it 
does
stop any NaNs occurring in the output. Hopefully if this isn't the right fix 
you'll
be able to use this info to work out what is (I presume that using doubles 
rather
than floats would make the problem much less likely to occur).

Original comment by admac...@gmail.com on 6 Jul 2009 at 6:14

Attachments:

GoogleCodeExporter commented 9 years ago
Thanks very much for this. It never occurred to me how brittle the NOT syntax 
would
be when using large documents as queries, how silly of me. I'd suggest that we 
fix
this by changing it to something that can't be done accidentally, e.g., by 
making the
user type in "-NOT" rather than just "NOT" to explicitly ask for vector 
negation.

I'm not sure about declaring the normalized version of zero to be zero, I can 
see it
being a handy convention but I'm wary of saying that all callers of the function
should expect this. An alternative would be to throw a ZeroVectorException: 
it's easy
to refactor the other VectorUtils to catch these, but in doing so I've created
regression testing errors so it's not something I want to commit to. Given 
this, I'm
tempted to push the assumption into getNormalizedVector as you suggest.

I'll try to see where the tests are breaking later and see if any alternative 
looks
suitably reliable.

Original comment by dwidd...@gmail.com on 6 Jul 2009 at 7:12

GoogleCodeExporter commented 9 years ago
Won't changing to '-NOT' break compatibility for existing users (sometimes 
necessary
of course)? What about a flag to the method to suppress negation? (For my 
purposes
I'll probably just filter out 'NOT' for the moment  - which I'm sure won't cost 
me
anything - but it might be nice if there were a cleaner way to do it)

Original comment by admac...@gmail.com on 7 Jul 2009 at 4:06

GoogleCodeExporter commented 9 years ago
I'd send a message to the group before changing NOT to -NOT. But I don't think 
it
will bother anyone. Just need to check that it doesn't look like another flag 
to the
parser.

Independently, I do like your idea of setting a flag "-negatedqueries false" or
something to that effect.

And independently again, I've come over to agreeing with your suggestion that 
we set
the norm of zero to be zero. Catching the ZeroVectorException every time is just
pointlessly fiddly, I think, and I haven't found any examples in the codebase 
where
zero is a "wrong" answer.

Where are we at this point? Ready to declare the issue fixed?

Original comment by dwidd...@gmail.com on 7 Jul 2009 at 8:32

GoogleCodeExporter commented 9 years ago
Seems fixed from my end - none of the problematic documents from before are 
causing
problems. 

Original comment by admac...@gmail.com on 8 Jul 2009 at 1:19

GoogleCodeExporter commented 9 years ago
Great, marking this as Verified.

Original comment by dwidd...@gmail.com on 8 Jul 2009 at 4:07