Closed kevwalsh closed 3 months ago
@joagnitti @BerniXiongA6 this is the epic around KB search. There's also an issue about integrating a new search option. We should bring this back out of the icebox for Q4 goal of KB improvements.
@edmund-dunn
It sounds like Marisa plans on adding to the taxonomy so that would help improve the current search which seems to be based on a view. View based searches can get a bit wonky once you add too many fields, though. Depending on how many search elements we end up with it may be better to switch to something more robust like a Solr based search.
@anantais I once heard (maybe right or wrong) that Solr had a cost. Is that true? You're not the first person (or even second) to mention Solr as an option so I'm very curious.
@EWashb Solr is open source so it should be free. I have not implemented it on a site yet but I have heard good things from other devs. I have experience with something similar - Elastic search - but I would never recommend it.
Once Jake is onboard and has full access, he will be taking over this work.
@JakeBapple I would like you to focus on this work for sprint 8.
Schedule pre-refinement/next steps meeting next week. Jake will do discovery work and research in the meantime.
Setting up Apache Solr may require more than just dev work (additional servers, permissions, etc.), so I would love to pull in @edmund-dunn to see if he has any opinions/experience with implementing this. We need to confirm our needs out of this search first and then decide our best solution as well. If we can get away with a search out of the box with just a view or not.
Rescope to be a ticket:
For Jake, this will be a historical discovery (dig into previous tickets to understand the history of the search) and then look for quick technical enhancements for the KB search.
Some discovery notes: We are using drupal database as source with Knowledge Base having its own index covering explicitly ONLY knowledge base articles.
The index warns of some performance impacts of automatic indexing of content for larger sites, and I'm not sure if this is something already looked into or not:
Search processors we are running: Content access Adds content access checks for nodes and comments. Entity status Exclude inactive users and unpublished entities (which have a "Published" state) from being indexed. Highlight Adds a highlighted excerpt to results and highlights returned fields. HTML filter Strips HTML tags from fulltext fields and decodes HTML entities. Use this processor when indexing HTML data – for example, node bodies for certain text formats. The processor also allows to boost (or ignore) the contents of specific elements. Ignore case Makes searches case-insensitive on selected fields. Stemmer Stems search terms (for example, talking to talk). Currently, this only acts on English language content. It uses the Porter 2 stemmer algorithm (More information). For best results, use after tokenizing.
Processors we are not running: Ignore characters Configure types of characters which should be ignored for searches. Index hierarchy Allows the indexing of values along with all their ancestors for hierarchical fields (like taxonomy term references) Number field-based boosting Adds a boost to indexed items based on the value of a numeric field. Reverse entity references Allows indexing of entities that link to the indexed entity. Role-based access Adds an access check based on a user's roles. This may be sufficient for sites where access is primarily granted or denied based on roles and permissions. For grants-based access checks on "Content" or "Comment" entities the "Content access" processor may be a suitable alternative. Stopwords Allows you to define stopwords which will be ignored in searches. Caution: Only use after both 'Ignore case' and 'Tokenizer' have run. Tokenizer Splits text into individual words for searching. Transliteration Makes searches insensitive to accents and other non-ASCII characters. Type-specific boosting Adds a boost to indexed items based on their datasource and/or bundle.
The processors we aren't running that may be worth looking into in my opinion are:
Index heirarchy - more description: This processor is mainly used in conjunction with hierarchical taxonomy vocabularies. If you have such a vocabulary, you usually want searches for a high-level category to also return results tagged with lower-level terms – for instance, filtering for "Europe" should also return content from "Denmark". This processor will facilitate this behavior by indexing, for every encountered taxonomy term, all its parent/ancestor terms, too. This also works for fields of other types that reference entities of the same type. If you have such a setup and want hierarchy functionality for that, too, you can also use this processor.
Tokenizer Splits indexed text into individual words. As dedicated search backends, like Apache Solr or OpenSearch, typically do a very good job in this regard, it is mainly meant for use with the Database backend, for which it offers more control over the tokenization process. The processor works in the following way when indexing a piece of text (say, a node’s body):
At search time, the keywords entered by the user are processed in a similar manner to ensure they match as expected.
Stopwords Keeps certain (configured) words from being indexed, usually very common words that don't add much meaning. This can be used to make matching and scoring more accurate, and also improve performance (by keeping the fulltext index smaller). For best results, this should be used alongside (and after) "Tokenizer".
Ignore characters (maybe) Allows you to remove certain characters from indexed field values and search keywords.
To fully test these options in concert or individually, I'll need some search context for what users are frustrated with to see if I can get this to work as expected.
Listing current behavior for the use cases listed in this issue.
Searching for "log in": This seems to be what a user would expect.
Searching for "system health service": This is not what would be expected.
Searching for "session 3": This seems to be what a user would expect.
Searching for "pdf": This seems acceptable given how little context the search term is.
Searching for "broken links": This also seems acceptable given what is being searched.
Only #2 (searching for system health services) appears to be of any issue from this ticket.
Implementing indexing for these settings:
Here are the changes in results:
Searching for "log in": Minor changes in lower results, but encouraging that it's looking more for "log in" than just "log" and "in"
Searching for "system health service": Minor changes in 2nd and lower results ranking where "health system service" is given more weight.
Searching for "session 3": Worse results for session 3, this is most likely because numbers are not being weighted at all, but the word "session" is seen more times on pages 5 and 1 so it's ranked higher.
Searching for "pdf": No change
Searching for "broken links": No change
Great work, @JakeBapple! I am going to grab time for us to regroup on this work next week to determine next steps.
After the pre-refinement, this is how we will move forward:
Background
Previous iteration of the CMS team looked into improvements to the knowledge base search. The current search is limited and editors have complained that it doesn't provide the results that they need. As we are redesigning the knowledge base, we want to look at the history of the search, how previous CMS teams have proposed improvements, and look for opportunities for quick fixes now and a more long-term approach to fixing the knowledge base search.
User Story or Problem Statement
The CMS team needs to understand how the original knowledge base search was created and opportunities for both short team and long term solutions.
Previous Team's Proposed Solutions
In #7012, we found some concrete ways to improve search in the short term, including more search-friendly content, and better indexing of titles.
We also identified a number of ideas to some of the user stories identified.
Relevant Links
Affected users and stakeholders
Acceptance Criteria