avalonmediasystem / avalon

Avalon Media System – Samvera Application
http://www.avalonmediasystem.org/
Apache License 2.0
94 stars 51 forks source link

Prevent / support deep pagination #5838

Closed elynema closed 2 months ago

elynema commented 6 months ago

Description

It is a known issue with Solr / Blacklight that deep pagination into either facet sets or results will cause significant performance issues. We fairly regularly see occurrences in the MCO logs where requests are coming in past the 100 page mark. Although these don't normally exhibit paging (ex: requesting page 651, 651, 652, etc.) and are one-off requests, more targeted paging does sometimes occur and may be contributing to sudden slowdowns in Solr and CPU spikes on the Solr server.

How can we better handle these situations?

Done Looks Like

joncameron commented 6 months ago

Could be suggestions from the community on Blacklight and pagination performance to add to this issue.

cjcolvar commented 3 months ago

Some quick links to blacklight discussion about deep pagination for me to look at later: https://github.com/projectblacklight/blacklight/pull/3094 https://github.com/projectblacklight/blacklight/issues/1665 https://code4lib.slack.com/archives/C54TB5WDQ/p1564656616035000 https://code4lib.slack.com/archives/C0B3ELJQM/p1631023853055000 The first one is interesting since it provides a possible solution of just removing the last pages from the pagination bar at the bottom of search results so it becomes 1 2 3 4 5 ... instead of 1 2 3 4 5 ... 10,825 10,826 Bots and users could still dig that deep but would make it less likely.

masaball commented 2 months ago

Thanks for the links! I had found some of the code linked from the slack discussions and issue 1665, but there was some good additional info in the other threads.

PR 3094 is an interesting one and does not look like it is in any v7 releases, so would be another impetus for pushing to Blacklight 8 once it supports Rails 7.2. Stanford homebrewed a similar approach 6-7 years ago, so neat to see comparable behavior incorporated upstream.

From reading through all of the links from @cjcolvar , the most common approach by other blacklight institutions has been to limit deep pagination.

Other site search behavior:

Blacklight does not support cursorMarks because Solr is set up to only work moving forward, not bi-directionally which could break paging through results. Also they would not help with bots/users jumping to arbitrary pages. If we wanted to investigate it anyways, we would have to roll our own implementation and with the limitations on the solr side I am not confident how much real benefit there would be.

There is some discussion of sitemaps and schemas in the code4lib conversations and it seemed like people were saying that it can potentially result in reduced bot traffic but that deep pagination requests are heavy enough that sometimes a single request can cripple a large enough dataset. I do not think we are quite that large a dataset, but sitemaps seem like something that would be beneficial in general, but would not necessarily have a direct effect for this issue.

So at this time, it seems like the main way forward would be to limit how deep users can paginate, and maybe upgrade to blacklight 8 to get the configurable pagination bar.

elynema commented 2 months ago

Presumably a future release of Blacklight 8.* will support Rails 7.2. Current Blacklight 8 isn't there yet.

elynema commented 2 months ago

Propose discussing first at Backlog Refinement, then we can schedule more time for discussion if needed.

joncameron commented 2 months ago

Looking at log data could be helpful as well to see what the requests are like in practice.

elynema commented 2 months ago

I asked Digital Collections and IUCAT folks if they are doing anything about this. Digital Collections said no.

David Elyea said about IUCAT:

I think you could still use Rack Attack to limit paging if what you're seeing is every couple seconds the same IP Number (or user agent possibly) is requesting a new page of results. Here's a link to show how one app attempted this: https://github.com/mastodon/mastodon/blob/a021dee64214fcc662c0c36ad4e44dc1deaba65f/config/initializers/rack_attack.rb#L93 12:44 I've done A LOT with Rack Attack for IUCAT if you have any questions or need help. It's helped us a lot with bot issues. I think you might even be able to put a custom response page up in case any actual user accidentally gets throttled for your "deep pagination" rule.

joncameron commented 2 months ago

Putting Blacklight 8.x on the roadmap would be a good next step in that part of the investigation.

Next step: Look at the logs and retrieve service statistics: examining how much of an issue it is for us can be part of this; we don't need to fully block things off if it's not a large performance issue in our case. If the logs don't point to humans, it's best practice to disable this unless we could say it's not a problem.

Ideal for us to not disable this. In practice real users are unlikely to be regularly doing deep pages of search results.

Others report using https://github.com/rack/rack-attack successfully. See https://github.com/mastodon/mastodon/blob/a021dee64214fcc662c0c36ad4e44dc1deaba65f/config/initializers/rack_attack.rb#L93 for throttling setting in this library.

Also: what is the current level of throttling at the proxy level? We can check in about the current production status of how this is handled in our server architecture.

joncameron commented 2 months ago

@joncameron to write a new investigation issue to carry on with the work here regarding paging and what we could do to investigate the real world load and performance issue mitigation.

joncameron commented 2 months ago

Follow-on issue: #6038