WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
250 stars 202 forks source link

Dead link validation Redis key size optimisation #2400

Open sarayourfriend opened 1 year ago

sarayourfriend commented 1 year ago

Problem

Dead link validation uses the following template for Redis keys: valid:{result.url}. The URLs for works can be very long, especially from providers like Wikimedia and Rawpixel (both of which are quite common). Even the shorter common URLs like Flickr can be easily optimised.

These keys could be optimised to save an estimated 1 to 1.5 GB in our production Redis.

Description

Change dead link validation keys to use the UUID of the work with the hyphens stripped. This would reduce the relatively small Flickr keys to 90 B rather than 120 B. It would reduce Wikimedia keys from 300 B to 90 B.

Cumulatively this would bring a massive reduction in key size.

Caveat: We need to avoid the performance penalty that would come from switching the key format without saving existing validation data. There are two approaches to this. One is faster but more complicated and the other is much slower but far simpler. We will follow the approach below because:

  1. The performance management/throttling needs of the fast version are unclear to me and I don't know how to estimate how "fast" processing 26 million keys in this way would actually be.
  2. Our Redis memory usage will go up as a result of #1969. The fast version is only worth it if we can reduce data usage quickly enough so that we can downgrade our Redis instance and save money before #1969 is implemented. After that IP is implemented, we will almost certainly need to upgrade our Redis instance size again.
  3. It isn't even that "slow" in the context of our project.
  4. It is significantly simpler and way harder for us to add bugs or cause problems with our app.

Decided approach: Configure check_dead_links to read cached validation from both key formats but only save to the new one. 120 days after that change is deployed, all keys in the previous format will have expired. At that point we can remove any logic that queries from them. This avoids the complications of the fast version but takes 120 days.

An alternative approach was considered that may be faster but rejected for the reasons above:

Switching the key format will come with a potentially significant search penalty due to reprocessing links for which we already have validation information. We would need to port existing keys to the new format. URL is not indexed in the API database, so we need to query Elasticsearch using url.keyword to get the identifier. This requires iterating through Redis keys which must be done using SCAN to prevent performance issues. Key TTL should be copied to the new key using the TTL command to retrieve it. This operation needs to be throttled to avoid performance impact to production Redis and Elasticsearch. A new key prefix needs to be used to prevent reprocessing of already converted keys and prevent the SCAN from going forever. This also allows the check_dead_links function to temporarily check both prefixes and key formats during the transition and to start saving new validation data under the new format without suffering a penalty during the transition and without adding new data that the key re-write process would need to manage.

zackkrida commented 1 year ago

I think the slow version sounds perfectly acceptable, and not even that slow relative to this project. 👍

obulat commented 1 year ago

Lowered the priority to medium because we presently have sufficient storage capacity (based on the weekly developer chat discussion)