department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
99 stars 68 forks source link

Further improve Knowledge Base search #8136

Closed kevwalsh closed 3 months ago

kevwalsh commented 2 years ago

Background

Previous iteration of the CMS team looked into improvements to the knowledge base search. The current search is limited and editors have complained that it doesn't provide the results that they need. As we are redesigning the knowledge base, we want to look at the history of the search, how previous CMS teams have proposed improvements, and look for opportunities for quick fixes now and a more long-term approach to fixing the knowledge base search.

User Story or Problem Statement

The CMS team needs to understand how the original knowledge base search was created and opportunities for both short team and long term solutions.

Previous Team's Proposed Solutions

In #7012, we found some concrete ways to improve search in the short term, including more search-friendly content, and better indexing of titles.

We also identified a number of ideas to some of the user stories identified.

  1. add another sort (title ASC) to the results view. This will improve readability for serialized articles/make it easier to find results.
  2. add tagging by category (curated taxonomy)
  3. add tagging by keyword (open taxonomy)
  4. exposing filters for category, product in the results view
  5. adding clickable keyword tags to search result and full page view
  6. potentially adding the rate module so users can rate what was helpful and adding that as a weighting criterion
  7. switching to SOLR based search can further improve the search accuracy, string matching and weighting options that we have available

Relevant Links

Affected users and stakeholders

Acceptance Criteria

EWashb commented 1 year ago

@joagnitti @BerniXiongA6 this is the epic around KB search. There's also an issue about integrating a new search option. We should bring this back out of the icebox for Q4 goal of KB improvements.

gracekretschmer-metrostar commented 8 months ago

@edmund-dunn

anantais commented 7 months ago

It sounds like Marisa plans on adding to the taxonomy so that would help improve the current search which seems to be based on a view. View based searches can get a bit wonky once you add too many fields, though. Depending on how many search elements we end up with it may be better to switch to something more robust like a Solr based search.

EWashb commented 7 months ago

@anantais I once heard (maybe right or wrong) that Solr had a cost. Is that true? You're not the first person (or even second) to mention Solr as an option so I'm very curious.

anantais commented 7 months ago

@EWashb Solr is open source so it should be free. I have not implemented it on a site yet but I have heard good things from other devs. I have experience with something similar - Elastic search - but I would never recommend it.

gracekretschmer-metrostar commented 7 months ago

Once Jake is onboard and has full access, he will be taking over this work.

gracekretschmer-metrostar commented 6 months ago

@JakeBapple I would like you to focus on this work for sprint 8.

gracekretschmer-metrostar commented 6 months ago

Schedule pre-refinement/next steps meeting next week. Jake will do discovery work and research in the meantime.

JakeBapple commented 3 months ago

Setting up Apache Solr may require more than just dev work (additional servers, permissions, etc.), so I would love to pull in @edmund-dunn to see if he has any opinions/experience with implementing this. We need to confirm our needs out of this search first and then decide our best solution as well. If we can get away with a search out of the box with just a view or not.

gracekretschmer-metrostar commented 3 months ago

Rescope to be a ticket:

For Jake, this will be a historical discovery (dig into previous tickets to understand the history of the search) and then look for quick technical enhancements for the KB search.

JakeBapple commented 3 months ago

Some discovery notes: We are using drupal database as source with Knowledge Base having its own index covering explicitly ONLY knowledge base articles.

The index warns of some performance impacts of automatic indexing of content for larger sites, and I'm not sure if this is something already looked into or not: image

Search processors we are running: Content access Adds content access checks for nodes and comments. Entity status Exclude inactive users and unpublished entities (which have a "Published" state) from being indexed. Highlight Adds a highlighted excerpt to results and highlights returned fields. HTML filter Strips HTML tags from fulltext fields and decodes HTML entities. Use this processor when indexing HTML data – for example, node bodies for certain text formats. The processor also allows to boost (or ignore) the contents of specific elements. Ignore case Makes searches case-insensitive on selected fields. Stemmer Stems search terms (for example, talking to talk). Currently, this only acts on English language content. It uses the Porter 2 stemmer algorithm (More information). For best results, use after tokenizing.

Processors we are not running: Ignore characters Configure types of characters which should be ignored for searches. Index hierarchy Allows the indexing of values along with all their ancestors for hierarchical fields (like taxonomy term references) Number field-based boosting Adds a boost to indexed items based on the value of a numeric field. Reverse entity references Allows indexing of entities that link to the indexed entity. Role-based access Adds an access check based on a user's roles. This may be sufficient for sites where access is primarily granted or denied based on roles and permissions. For grants-based access checks on "Content" or "Comment" entities the "Content access" processor may be a suitable alternative. Stopwords Allows you to define stopwords which will be ignored in searches. Caution: Only use after both 'Ignore case' and 'Tokenizer' have run. Tokenizer Splits text into individual words for searching. Transliteration Makes searches insensitive to accents and other non-ASCII characters. Type-specific boosting Adds a boost to indexed items based on their datasource and/or bundle.

The processors we aren't running that may be worth looking into in my opinion are:

  1. If enabled, rudimentary CJK handling is applied.
  2. Numbers only separated by punctuation (like dates, telephone numbers, etc.) are merged to a single string of digits, such that it is possible to find them even when formatted in a slightly different way.
  3. The configured “ignored characters” are handled, if any: occurrences of two or more consecutive “ignored characters” are replaced by spaces, then all remaining ones are removed from the text.
  4. The text is then split into tokens, taking the configured “whitespace characters” as the separators.
  5. Finally, all tokens that are shorter than the configured “minimum word length” are removed.

At search time, the keywords entered by the user are processed in a similar manner to ensure they match as expected.

To fully test these options in concert or individually, I'll need some search context for what users are frustrated with to see if I can get this to work as expected.

JakeBapple commented 3 months ago

Listing current behavior for the use cases listed in this issue.

  1. Searching for "log in": image This seems to be what a user would expect.

  2. Searching for "system health service": image This is not what would be expected.

  3. Searching for "session 3": image This seems to be what a user would expect.

  4. Searching for "pdf": image This seems acceptable given how little context the search term is.

  5. Searching for "broken links": image This also seems acceptable given what is being searched.

Only #2 (searching for system health services) appears to be of any issue from this ticket.

JakeBapple commented 3 months ago

Implementing indexing for these settings: image

Here are the changes in results:

  1. Searching for "log in": image Minor changes in lower results, but encouraging that it's looking more for "log in" than just "log" and "in"

  2. Searching for "system health service": image Minor changes in 2nd and lower results ranking where "health system service" is given more weight.

  3. Searching for "session 3": image Worse results for session 3, this is most likely because numbers are not being weighted at all, but the word "session" is seen more times on pages 5 and 1 so it's ranked higher.

  4. Searching for "pdf": No change

  5. Searching for "broken links": No change

gracekretschmer-metrostar commented 3 months ago

Great work, @JakeBapple! I am going to grab time for us to regroup on this work next week to determine next steps.

gracekretschmer-metrostar commented 3 months ago

After the pre-refinement, this is how we will move forward:

  1. Jake will work on improving the indexing and tokenizing of knowledge base search.
  2. Jake will bring in the design layout from Marisa's redesign for knowledge base landing page.
  3. Marisa will partner with Jake on building out a taxonomy for the knowledge base.
  4. Jake will use the taxonomy to improve the search.
  5. CMS team will then explore using elastic or solar search.