clarin-eric / VLO

Virtual Language Observatory
GNU General Public License v3.0
14 stars 6 forks source link

Solr schema: consider making some 'string' fields 'text' #162

Closed twagoo closed 6 years ago

twagoo commented 6 years ago

Most facet fields currently have type string. For search (ranking) purposes, it is probably preferable to change the type of these to text_general, which implies tokenisation and preparing for case insensitive matching. Proposed fields to change the type for:

_languageName already is type text_general

This may increase the index size and time somewhat.

twagoo commented 6 years ago

Note: this was triggered by a report by @hannahedeland. See CLARIN-D support system ticket #2018032610000028

teckart commented 6 years ago

I would argue that values for search facets should be treated as "atomic" (except support of case-insensitivity) and not processed further (like tokenization). The purpose of the facets is to store some kind of "closed" vocabulary that we shouldn't treat as fulltext. The main search index ("text") - containing almost the complete file content - already supports this functionality.

It also would not solve Hanna's problem of missing HZSK resources for her search query ("hzsk") as this is already supported by the main index ("text"). It rather seems to be a problem of a broken index in the public VLO which only finds 40 resources, whereas the beta VLO (and also my local test instance) finds more than 2000.

twagoo commented 6 years ago

Good point that this should be covered by the full text index of the document's contents. So let's try to find out what's wrong with the production index. Depending on our findings we can probably close this ticket.

teckart commented 6 years ago

The problem can be solved by reverting (again) commit 4fd0ec8020a7024e3f21ff159fcfe789ea7f4cbb. The majority of resources are not indexed in field text and are only found for full text queries via other (stored) fields. The consequences can be seen by checking the (small) number of resources with a field _suggester (query: "_suggester:*"), that is stored and only filled via copyField from text (not stored). It is unclear though, why a subset of records are indexed properly.

twagoo commented 6 years ago

Ok, let's revert 4fd0ec8 for the 4.4 release. The expected difference in index size was not achieved anyway. Indeed it would be nice to know what the reason for the partial population of that field is. Perhaps these are remnants of fairly old imports (I don't think the index has been flushed since that change was applied).

teckart commented 6 years ago

That should not be the case. In my test environment I shutdown Solr and delete the complete data directory every time before starting a new import to eliminate those effects. An unstored field text has always the same consequences, applying to exactly the same ~set~ number of files every time (tested for queries like "hzsk", "hamatac" or "creator" for multiple imports).

twagoo commented 6 years ago

Reverted in 6af18d45366677fe6b2c7e7a7a3b44a8bad68ade Will try a new import on alpha shortly.

twagoo commented 6 years ago

After import (with alpha at e6f2ff2) the query hzsk yields 2214 results.