Closed twagoo closed 6 years ago
Note: this was triggered by a report by @hannahedeland. See CLARIN-D support system ticket #2018032610000028
I would argue that values for search facets should be treated as "atomic" (except support of case-insensitivity) and not processed further (like tokenization). The purpose of the facets is to store some kind of "closed" vocabulary that we shouldn't treat as fulltext. The main search index ("text") - containing almost the complete file content - already supports this functionality.
It also would not solve Hanna's problem of missing HZSK resources for her search query ("hzsk") as this is already supported by the main index ("text"). It rather seems to be a problem of a broken index in the public VLO which only finds 40 resources, whereas the beta VLO (and also my local test instance) finds more than 2000.
Good point that this should be covered by the full text index of the document's contents. So let's try to find out what's wrong with the production index. Depending on our findings we can probably close this ticket.
The problem can be solved by reverting (again) commit 4fd0ec8020a7024e3f21ff159fcfe789ea7f4cbb. The majority of resources are not indexed in field text and are only found for full text queries via other (stored) fields. The consequences can be seen by checking the (small) number of resources with a field _suggester (query: "_suggester:*"), that is stored and only filled via copyField from text (not stored). It is unclear though, why a subset of records are indexed properly.
Ok, let's revert 4fd0ec8 for the 4.4 release. The expected difference in index size was not achieved anyway. Indeed it would be nice to know what the reason for the partial population of that field is. Perhaps these are remnants of fairly old imports (I don't think the index has been flushed since that change was applied).
That should not be the case. In my test environment I shutdown Solr and delete the complete data directory every time before starting a new import to eliminate those effects. An unstored field text has always the same consequences, applying to exactly the same ~set~ number of files every time (tested for queries like "hzsk", "hamatac" or "creator" for multiple imports).
Reverted in 6af18d45366677fe6b2c7e7a7a3b44a8bad68ade Will try a new import on alpha shortly.
Most facet fields currently have type
string
. For search (ranking) purposes, it is probably preferable to change the type of these totext_general
, which implies tokenisation and preparing for case insensitive matching. Proposed fields to change the type for:collection
keywords
resourceClass
subject
genre
modality
projectName
organisation
nationalProject
country
_languageName
already is typetext_general
This may increase the index size and time somewhat.