adsabs / montysolr

Solr for Astrophysics Data System
https://ui.adsabs.harvard.edu
Other
52 stars 18 forks source link

Search misses "Nordic Optical" telescope results - possibly as a result of core optimization #179

Open romanchyla opened 3 years ago

romanchyla commented 3 years ago

Because we don't store term vectors (due to size) this is terribly difficult to debug, but here is a review of what is known so far

Search for "nordic optical" in body or abstract, will find less documents than expected.

Now, hold your breath .... tada, but only for index built 1 week ago! An index which was built from scratch this Saturday is unaffected.

What is different? The index from last week has been compacted. The solr release building that index also had a bug (which should however only impact documents that had a synonym on the very first position of the indexed stream; and it resulted in docs being rejected -- i.e. not indexed)

Everything else is the same, including synonyms that are used for index time tokenization.

The problem is with search query, the following abstract:"nordic optical" becomes abstract:"nordic syn::optical"

collection1: 1203 results collection2: 1170

when searching with =abstract:"nordic optical"

collection1: 1203 results collection2: 1203

when searching abstract:"nordic syn::optical" (this one can only be done from inside Luke with whitespace analyzer):

collection2: 1170 results

So for 33 documents, the position of the token syn::optical -- looks like -- moved by 1. But I have no way to tell because we can't reconstruct the document due to missing term vectors.

This query: abstract:nordic NEAR1 abstract:optical

collection1: 1205 collection2: 1205

Which is totally confusing! -- PROXIMITY search only considers tokens that are next to each other, so it is (almost) the same thing as a phrase search. And I tried abstract:"syn::optical nordic" -- to verify the tokens were not swapped; that produces 0 results

At this point, the suspicion falls on core optimization -- to verify this theory, we'll have to repeat the same action. But we need to wait to have a new core built; not wanting to screw production (which works and is producing correct results)

romanchyla commented 3 years ago

bit of debugging info:

  1. ssh -Y adsqb
  2. download luke and extract
  3. cd /proj.adsqb/var/lib/docker/volumes/backoffice_prod_montysolr_engine_data/_data/luke....
  4. ./luke.sh and open the index
romanchyla commented 3 years ago

to verify (or find missing documents):

=body:"nordic optical" NOT body:"nordic optical"

if everything works as expected, the query must result in 0 docs (collection2 returns 224 docs right now; collection1 0)