quick fix to sort issue

mschoch commented 2 years ago

It has been observed that in some cases the computed sort value for a DocumentMatch becomes corrupted. The problem has been traced back to the doc values uncompressed slices, but for now a quick fix is proposed to copy the bytes associate with the sort key, ensuring that no other doc values operations can corrupt them.

mschoch commented 2 years ago

This is a quick fix that addresses the problem I found (though I do not yet have simple testcase to add).

It's not clear if this is the best fix, but it should be safe, as we just create an additional copy of the data we care about. There should will also be some slight perf hit caused by the additional copy of sort keys.

tmm1 commented 2 years ago

LGTM as a workaround

mschoch commented 2 years ago

Yesterday I spoke with @sreekanth-cb from the Bleve project to better understand why this issue seemed to affect Bluge and not Bleve. The key change has to do with the data type used for the sort key inside a match. In Bleve, sort keys are []string and during the development of Bluge I switched this to [][]byte. The thinking was that since Bluge is more focused on Go application developers, and not forcing decisions which can be made by the application, we should avoid the conversion to string. Also, there was an obvious performance gain as well (avoid unnecessary copies). However, what I didn't realize at the time was that this conversion to a string was not just an API convenience for Bleve, it was also ensuring correct behavior by copying a value that required copying.

So, with this deeper understanding of the problem, it now seems that this is the correct fix given the current design (doc values are made available inside a callback, and are only valid for the duration of the callback)

@voldyman fortunately this is somewhat easy to do back-of-the-napkin math to try and predict the behavior. With this change, one would expect the following additional memory allocations:

number of allocations: number of search hits number of elements in the sort key amount allocated: number of search hits avg sort key size in bytes

In the strict sense, it will of course perform slower than before, because additional copies are now made. However, without the fix, the sort order is easily corrupted (by design we reuse a buffer). However, these copies were already being made in the Bleve code-base, so if you're comparing against Bleve it should still be comparable.

The only other way to fix the problem would be to change the design of doc values, such that they don't require a copy. However, this would either require giving up compression of the doc values, or allocating new buffers for each decompression, both of which seem likely to cause an even larger performance regression.

mschoch commented 2 years ago

I'm going to merge this as we now have high confidence this is a real bug, and this fix restores correct behavior.

However, this fix copies all sort values, but really only doc values that are used for sorting need to be copied. This matters because some sort values are not derived from doc values, or they may originate form doc values, but produce a new safe-to-use []byte. In those cases we do not need to make an additional copy of the bytes. I will continue to investigate this, and see if there a way to offer a tighter fix.

blugelabs / bluge

quick fix to sort issue #87