Open sarayourfriend opened 1 year ago
I think the slow version sounds perfectly acceptable, and not even that slow relative to this project. 👍
Lowered the priority to medium because we presently have sufficient storage capacity (based on the weekly developer chat discussion)
Problem
Dead link validation uses the following template for Redis keys:
valid:{result.url}
. The URLs for works can be very long, especially from providers like Wikimedia and Rawpixel (both of which are quite common). Even the shorter common URLs like Flickr can be easily optimised.These keys could be optimised to save an estimated 1 to 1.5 GB in our production Redis.
Description
Change dead link validation keys to use the UUID of the work with the hyphens stripped. This would reduce the relatively small Flickr keys to 90 B rather than 120 B. It would reduce Wikimedia keys from 300 B to 90 B.
Cumulatively this would bring a massive reduction in key size.
Caveat: We need to avoid the performance penalty that would come from switching the key format without saving existing validation data. There are two approaches to this. One is faster but more complicated and the other is much slower but far simpler. We will follow the approach below because:
Decided approach: Configure
check_dead_links
to read cached validation from both key formats but only save to the new one. 120 days after that change is deployed, all keys in the previous format will have expired. At that point we can remove any logic that queries from them. This avoids the complications of the fast version but takes 120 days.An alternative approach was considered that may be faster but rejected for the reasons above:
Switching the key format will come with a potentially significant search penalty due to reprocessing links for which we already have validation information. We would need to port existing keys to the new format. URL is not indexed in the API database, so we need to query Elasticsearch using
url.keyword
to get the identifier. This requires iterating through Redis keys which must be done usingSCAN
to prevent performance issues. Key TTL should be copied to the new key using theTTL
command to retrieve it. This operation needs to be throttled to avoid performance impact to production Redis and Elasticsearch. A new key prefix needs to be used to prevent reprocessing of already converted keys and prevent the SCAN from going forever. This also allows thecheck_dead_links
function to temporarily check both prefixes and key formats during the transition and to start saving new validation data under the new format without suffering a penalty during the transition and without adding new data that the key re-write process would need to manage.