kaladay commented 1 year ago

Description

This commit represents the results of the investagtion described in issue #515.

There are several things to consider and this takes a compatible approach focusing on re-using as much of the existing design as possible. There is some inconvenience or cost for taking this approach.

The use of df (default field) is not implemented with these changes but is strongly recommended for future Sage releases. This does not update the documentation (such as the Github documentation). That documentation needs to be updated following this set of changes.

Source Code Changes

Change the default query from *:* to just *. This restricts the searches to just the fields being searched rather than searching on everything. This can be further improved by providing a df (default field). When multiple fields are to be searched, then the df may not be a good idea.

Searching using a selected field now provides a wild card when the search field value is empty. This allows for default searches without values to work correctly.

The filter query and wild card is being improperly added when there are multiple filters. The wild card filter search query is (improperly) being added when there are specific search values specified. Change the logic to only add the wild card filter search query when there are no filters selected.

Always enable sow by default. Ideally this should be a configurable option and future versions should allow for this to be configured. This is very important to have when passing strings. This may need to be turned off when intentionally adding quoted searches. This should be explicitly investigated for regressions. Additional logic can be added to dynamically disable this if quoted strings are passed but this behavioral change is outside the scope of this change set.

Solr Core Changes

We should be using the managed schema with dynamic fields. This allows for fine-tuning the behavior on an as-needed basis on read and/or write. Much of the needed functionality to do this is actually present by design. This is strong evidence to the good quality in the planning and the simplicity of the original Sage design. However, there are additional changes that can be made to polish this interface. Such changes are outside the scope of this change set and are left for future Sage releases.

This implementation is a complete rewrite from scratch of the previous design. It favors dynamic fields while simultaneously providing backwards and forwards compatibility fields.

Some of the Core Fields are preserved mostly as-is. Aside from the Core Fields, Sage should exclusively utilize non-base field types and instead use the custom dynamic fields. The use of simple names are chosen for each of the dynamic fields but future versions of Sage should review and provide a more thorough plan on the naming scheme. Unfortunately, Solr Core has design limitations that prevents this from operating in an ideal manner.

The dynamic fields have a now standardized sub-type naming scheme. The default behavior is now always multi-valued. To have single valued type, the sub-type _si must be at the very end of the field name. To have the field get copied into the _text_ field, commonly known as the "Everything" field, the sub-type _t must be at the very end of the field name. These two sub-types are the only combinable types for design simplicity and practicality reasons.

Due to a design limitation of Solr, complex regex combinations of say _si and _t cannot be used. To work-around this problem, all combinations have to be manually created. To avoid having to write all permutations, the order of specific sub-types are explicitly defined. The _si must come before _t and the _t must always be the last part (necessary for the copy field for _t to automatically work). For example, a single valued title field that is copied into _text_ should be named title_si_t. This can be done by adding this a new metadatum. Not all combinations are supported for practicality reasons (and some might not make sense when combined).

The most notable dynamic fields are:

*_text (and its sub-types): All text fields and may have _si or _t.
*_string (and its sub-types): A dynamic field compatible variation of the base string type and may have _si or _t.
*_ws (and its sub-types): All text fields explicitly using the white-space splitting tokenizer and may have _si or _t.
*_facet (and its sub-types): All text fields that are designed to be used during faceting and must not have _si or _t.
*_whole (and its sub-types): All text fields being stored as a single field and the entire string must match, probably should never have _si or _t.

The original field schema structure is recreated utilizing these dynamic fields for compatibility reasons. Future versions of Sage ideally should not have most of these or should have none of these.

More complex "good to have" types such as URI are omitted. A URI type would be good to have because it would allow for partial matches against the URI (which could be a URN or a URL). For example searching for tamu.edu should match a full URL like https://library.tamu.edu/. At this time, these URI paths are using the _whole text type.

The research has revealed that searching using the white-space tokenizer results in more natural and more accurate searches. This is not well tested and there may be problems with this that are simply unknown due to my inexperience with the data and its intended daily or normal uses. The _ws Dynamic Field is available to provide control over switching between the standard and the white-space tokenizer algorithms by the Manager. This has a notable downside of adding even more fields to select from. This includes, perhaps, adding even more metadata.

Several of the problems with searching are solved by strongly separating the index and the query analyzers. Most of the time both the index and the query analyzers must be lower-case to match. This introduces a problem with the Faceting design provided by Sage. The tokenized results that are needed for both queries and indexes to operate in a proper or ideal manner interferes with the desired design of the faceting. The _facet Dynamic Field is added to address this problem. All facets should select the _facet Dynamic Field. This field is manually created using Copy Fields for compatibility reasons and to help avoid needing to create metadata for every single type to be faceted. This is an important design area that needs more design considerations for future Sage releases.

The _en_split text type is added to be compatible with the older Solr core designo. I do not know if this is needed and future development should consider the removal or further adoption of this.

Much of these changes create a strong desire to improve the Sage UI and UX to make handling all of the sub-types and Dynamic Fields more practical. Such changes are considered outside the scope of this issue and are ignored. This is another important area that needs more technical design planning.

The solrconfig.xml must be changed due to the re-write of the schema. The changes are kept to a bare minimum and are primarily focused around Field Name changes. The _str Copy Fields in the solrconfig.xml are entirely removed.

Basic stop words are added as an example. This needs to be properly setup. The _facet Dynamic Field utilizes these stop words. There may need to be more than one type, such as stopwords-facet.txt if the facets needs special handling beyond the normal stop words.

With this change, the system can have Dynamic Fields added through Sage itself. That means the Solr core does not need to be directly touched.

Be sure to rename managed-schema to managed-schema.xml when Solr Core gets updated to more recent versions.

Fixes #515

Type of change

Please delete options that are not relevant.

[x] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[x] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[x] This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

[x] Manually through SAGE's UI.
[x] Manually directly through Solr.
[x] Through unit tests.

Checklist:

[x] My code follows the style guidelines of this project
[x] I have performed a self-review of my code
[x] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[x] My changes generate no new warnings
[x] New and existing unit tests pass locally with my changes

coveralls commented 1 year ago

Coverage: 45.835% (-0.03%) from 45.866% when pulling a094c79ba8849a32351a0df71b30524d7faaf480 on 515-solr_schema-redesign into ac124474e4790c9a79f5cd083f42ae8c320ac23c on staging.

jeremythuff commented 1 year ago

I am interested to know how these changes might or might not impact existing discovery views. What, if any, impact do you think it would have?

kaladay commented 1 year ago

I am interested to know how these changes might or might not impact existing discovery views. What, if any, impact do you think it would have?

I built this schema based on reading documentation and doing live data testing (using Sage and its Discovery Views). So I know this works. There are details that do have an impact on the UX for a manager. When it comes to selecting the field or possibly adding additional metadata to utilize some of the custom fields.

Having to understand the difference between the normal and the _ws (white-space) types is going to be a UX concern.

TAMULib / SAGE

Issue 515: Redesign Solr search process, particularly the Solr Core. #523

Description

Type of change

How Has This Been Tested?

Checklist: