Closed GoogleCodeExporter closed 9 years ago
OK,
As mentioned in the e-mail today, I've had a look and I think I have a fix for
the
problems I've been seeing with CompareTerms. It's not however some intricacy of
VectorStoreRAM. I found that the problem was due to having the term 'NOT' in
the list
of terms, and eventually tracked it down the normalizing of the vectors in the
orthogonalization process that happens when there is a negated section in the
terms list.
Basically I think sometimes you get floating point rounding errors (most likely
only
when you throw pathologically long documents into CompareTerms as I am wont to
do).
Of course, when you then try and normalize a zero-filled vector, NaNs occur and
propagate everywhere. I'm not 100% sure on why this then puts the VectorStoreRAM
instance into a state where it persistently produces these errors, but maybe
you'll
have a better idea of that.
Anyway, I've got a small patch attached that stops NaNs occurring by setting a
normalized vector of entirely zeroes to also be zero. I'm not sure whether this
is
sensible or not (It's just occurred to me maybe you want NaNs there), but it
does
stop any NaNs occurring in the output. Hopefully if this isn't the right fix
you'll
be able to use this info to work out what is (I presume that using doubles
rather
than floats would make the problem much less likely to occur).
Original comment by admac...@gmail.com
on 6 Jul 2009 at 6:14
Attachments:
Thanks very much for this. It never occurred to me how brittle the NOT syntax
would
be when using large documents as queries, how silly of me. I'd suggest that we
fix
this by changing it to something that can't be done accidentally, e.g., by
making the
user type in "-NOT" rather than just "NOT" to explicitly ask for vector
negation.
I'm not sure about declaring the normalized version of zero to be zero, I can
see it
being a handy convention but I'm wary of saying that all callers of the function
should expect this. An alternative would be to throw a ZeroVectorException:
it's easy
to refactor the other VectorUtils to catch these, but in doing so I've created
regression testing errors so it's not something I want to commit to. Given
this, I'm
tempted to push the assumption into getNormalizedVector as you suggest.
I'll try to see where the tests are breaking later and see if any alternative
looks
suitably reliable.
Original comment by dwidd...@gmail.com
on 6 Jul 2009 at 7:12
Won't changing to '-NOT' break compatibility for existing users (sometimes
necessary
of course)? What about a flag to the method to suppress negation? (For my
purposes
I'll probably just filter out 'NOT' for the moment - which I'm sure won't cost
me
anything - but it might be nice if there were a cleaner way to do it)
Original comment by admac...@gmail.com
on 7 Jul 2009 at 4:06
I'd send a message to the group before changing NOT to -NOT. But I don't think
it
will bother anyone. Just need to check that it doesn't look like another flag
to the
parser.
Independently, I do like your idea of setting a flag "-negatedqueries false" or
something to that effect.
And independently again, I've come over to agreeing with your suggestion that
we set
the norm of zero to be zero. Catching the ZeroVectorException every time is just
pointlessly fiddly, I think, and I haven't found any examples in the codebase
where
zero is a "wrong" answer.
Where are we at this point? Ready to declare the issue fixed?
Original comment by dwidd...@gmail.com
on 7 Jul 2009 at 8:32
Seems fixed from my end - none of the problematic documents from before are
causing
problems.
Original comment by admac...@gmail.com
on 8 Jul 2009 at 1:19
Great, marking this as Verified.
Original comment by dwidd...@gmail.com
on 8 Jul 2009 at 4:07
Original issue reported on code.google.com by
dwidd...@gmail.com
on 26 Mar 2009 at 11:12