emory-libraries / blacklight-catalog

1 stars 0 forks source link

Rebuild and restore Collection filter to Advanced Search form based on prior spike #619

Closed tmiles2 closed 2 years ago

tmiles2 commented 3 years ago

This ticket follows prior work to enhance the page load time on the Advanced Search form, which was previously very slow when the Collections facet/filter was enabled.

In spike #510, several options were proposed to resolve this issue by changing how the collection facet is loaded and searched. Notes are copied below:


There have been multiple suggestions on how to handle facets with large numbers of values. The suggestions can be categorized two ways: 1. live searching of the problem fields via Solr, producing results limited to a number chosen by stakeholders, and 2. ajax querying a cache of all values that are updated whenever a reindex occurs. These suggestions can be applied to all facet fields, or just a selection, but bear in mind, when we carve out just a couple of the fields to work differently than the others, it increases the amount of code to write (more time spent) and may lead to user confusion.

Live Searching Solr Pros:

  1. We are not duplicating already existing data from Solr in our Rails application.
  2. We are querying a single point for truth.
  3. The amount of new code needed would be slightly less (create an API action tied to an AJAX input tag that would wait until the user has typed the needed amount of characters and has paused.) Cons:
  4. Possible flood of queries flowing in from multiple users utilizing the Advanced Search page at once. We should examine the usage of Lux' Advanced Search page and scale upwards to see if the amount of likely queries will cause a problem.
  5. If we selectively apply this to certain fields, we can limit the amount of queries to the Solr instance, but the user would probably be greeted with placeholder text on how to produce suggestions, which would be a break from all other fields that will list all of the available options in a pull down as well as a search. If we applied the new behavior to all fields, it would be repetitive searches until the users have made all of the faceting choices.

Cached Searching Here, the Cons reside in the forms of caching available to us. I will break each one of those out below. Pros:

  1. The needed values are closer (stored in same server) and, therefore, faster. Cons:
  2. Truth of values comes into question. There is a potential of omissions/additions depending how caching is implemented. I will discuss each method's downfalls below in their breakdowns.

Caching Methods

  1. Rails Caching Pros:
    • Practically built in. Number of lines to change could be less than 10.
    • Can be applied to the whole controller action, but also to just the smallest fragment of data. Cons:
    • Since this caches data after the page loads, that means some users will still see the long load times. For example, since OAI publishing happens four times a day, we can expect to set the caches to expire every six hours. That means that at least four users would see the same page load times we have now per day. If the page sees very little traffic, there is a possibility that all users in a day may see the same slow load times.
    • There will be variations of values based upon the random times that users initialize the caches. Indexes will be running while a user opens the page, which could cause the most discrepancies, but there coud be other situations that call the truth of the values into question.
  2. Dynamically-created Authority Document: Pros:
    • This would allow minimal code to be written using Samvera's Questioning Authority (which could deliver all or searched-for values).
    • It would also provide a human-readable document (YAML) to be downloaded directly from the servers. This would provide a reference in debugging or QA situations. Cons:
    • If we create/replace this document on a timed schedule, we will see gaps in truth of values. This could be minimized if we tie the replacement of this document to re/indexing.
    • The generation of this document involves all of the common problems associated with scraping data from objects and outputting it to a document (formatting, "garbage" characters, etc.) that would need to be debugged.
  3. Dynamically-created Authority Database: Pros:
    • This could also be utilized by Samvera's Questioning Authority with all of its built in functionality.
    • Creates the ability to edit the "source of truth" quickly via command line or possible admin page actions.
    • It would be best to tie the syncing of this database to the indexing, querying the Solr values after the records have updated. Cons:
    • Can't think of any.
eporter23 commented 2 years ago

@devanshu-m I tried testing against a few known collections and they don't seem to be coming up in the results (example: Raymond Danowski or Ted Hughes). Is this lookup pointed to the collection_ssim field?

Screen Shot 2021-07-23 at 11.10.51 AM.png

Screen Shot 2021-07-23 at 11.10.58 AM.png

devanshu-m commented 2 years ago

@eporter23 do we have collection names that are more than 255 characters? Names are saved in a database table as strings. If we have names more than 255 characters, I am guessing the save threw an error and stopped processing rest of the collection names.

eporter23 commented 2 years ago

@devanshu-m looks like we do have some long titles there. Here's an example of one with more than 255 chars: https://blackcat-test.library.emory.edu/catalog/990031862080302486

1976: Mental health statistical note / U.S. Department of Health, Education and Welfare, Public Health Service, Alcohol, Drug Abuse and Mental Health Administration, National Institute of Mental Health, Division of Biometry and Epidemiology, Survey and Reports Branch ; no. 144

devanshu-m commented 2 years ago

@eporter23 ok I can change columns to text and see if that helps.

lovinscari commented 2 years ago

@eporter23 I have reviewed the overall functionality but I would like you to confirm the work done to correct the formatting of the long titles meets your expectations. Please close once you have reviewed assuming all is working as expected.

eporter23 commented 2 years ago

Discussed with @devanshu-m and @bwatson78 making some further changes to optimize page load time. Suggestions include switching the lookup to use emory_collection_tesim (Emory-named collections only) and only loading the first 100 or so titles in the drop box until a user starts typing.

lovinscari commented 2 years ago

@eporter will create a new ticket for work on optimization of page load times.