Open romanchyla opened 3 years ago
bit of debugging info:
to verify (or find missing documents):
=body:"nordic optical" NOT body:"nordic optical"
if everything works as expected, the query must result in 0 docs (collection2 returns 224 docs right now; collection1 0)
Because we don't store term vectors (due to size) this is terribly difficult to debug, but here is a review of what is known so far
Search for
"nordic optical"
in body or abstract, will find less documents than expected.Now, hold your breath .... tada, but only for index built 1 week ago! An index which was built from scratch this Saturday is unaffected.
What is different? The index from last week has been compacted. The solr release building that index also had a bug (which should however only impact documents that had a synonym on the very first position of the indexed stream; and it resulted in docs being rejected -- i.e. not indexed)
Everything else is the same, including synonyms that are used for index time tokenization.
The problem is with search query, the following
abstract:"nordic optical"
becomesabstract:"nordic syn::optical"
collection1: 1203 results collection2: 1170
when searching with
=abstract:"nordic optical"
collection1: 1203 results collection2: 1203
when searching
abstract:"nordic syn::optical"
(this one can only be done from inside Luke with whitespace analyzer):collection2: 1170 results
So for 33 documents, the position of the token
syn::optical
-- looks like -- moved by 1. But I have no way to tell because we can't reconstruct the document due to missing term vectors.This query:
abstract:nordic NEAR1 abstract:optical
collection1: 1205 collection2: 1205
Which is totally confusing! -- PROXIMITY search only considers tokens that are next to each other, so it is (almost) the same thing as a phrase search. And I tried
abstract:"syn::optical nordic"
-- to verify the tokens were not swapped; that produces 0 resultsAt this point, the suspicion falls on core optimization -- to verify this theory, we'll have to repeat the same action. But we need to wait to have a new core built; not wanting to screw production (which works and is producing correct results)