archlinux / archweb

Arch Linux website code
https://archlinux.org
GNU General Public License v2.0
317 stars 128 forks source link

Cache package pages intelligently #514

Open jelly opened 1 month ago

jelly commented 1 month ago

Our packages pages can be quite slow without a cache as package.get_requiredby(), we could cache every package view's metadata with a common cache key or package-$pkgname-metadata and bust the cache once reporead has detected changes.

I don't know how often we update packages so maybe this isn't practical, or the cache is just bust all the time...

If we want to cache per package this is tricky as if I add a new package which depends on my cached package it won't show up. So either we write a smart caching busting algorithm or we don't.

Also other things influence the package page:

andrewSC commented 1 month ago

If we want to cache per package this is tricky as if I add a new package which depends on my cached package it won't show up. So either we write a smart caching busting algorithm or we don't.

I'm trying to think this through (apologies in advance if the logic/wording isn't clear lol).

So if I'm understanding correctly, the concern is if we have a scenario where: 1) Existing, available package is cached 2) New package is introduced that the existing cached package should list under "Required By" in the web ui, but doesn't, because it's cached (and hasn't been cache busted yet)

The concern is the existing cached package wouldn't show the new package under "Required By" in the web ui?

Can we just write two functions that: 1) if a new package is uploaded, checks/gets its dependencies 2) if the dependent package exists (probably safe to assume?), cache bust it so the next pull/page load from whomever looks at the page shows the new "Required By"'s?

Am I understanding the problem correctly?

jelly commented 1 month ago

If we want to cache per package this is tricky as if I add a new package which depends on my cached package it won't show up. So either we write a smart caching busting algorithm or we don't.

I'm trying to think this through (apologies in advance if the logic/wording isn't clear lol).

So if I'm understanding correctly, the concern is if we have a scenario where:

1. Existing, available package is cached

2. New package is introduced that the existing cached package should list under "Required By" in the web ui, but doesn't, because it's cached (and hasn't been cache busted yet)

The concern is the existing cached package wouldn't show the new package under "Required By" in the web ui?

Yes. We have Package.get_requiredby() which for glibc uses 100 SQL queries and I am not sure if we can even further optimise that. But glibc is probably the most heavy one so others also use 10-100 queries for a package view. So caching this information would be beneficial.

Can we just write two functions that:

1. if a new package is uploaded, checks/gets its dependencies

2. if the dependent package exists (probably safe to assume?), cache bust it so the next pull/page load from whomever looks at the page shows the new "Required By"'s?

Am I understanding the problem correctly?

Yes, the best would be to cache the metadata of a package because that only depends on Package updates. I tried this before and it was fast but my cache key was bogus so other pages showed the wrong information. The cache key for that should be:

$pkgname-$arch-$repo-$pkgver, we have to verify that this doesn't cause too many SQL queries to fetch name/arch/repo. Having the version in there would nicely invalidate the cache. I hope memcached drops useless caches by default.

Otherwise skip the pkgver.

For django template's we can create an unique cache key and destroy this cache when reading the repository metadata:

https://stackoverflow.com/questions/10778988/how-do-i-delete-a-cached-template-fragment-in-django

As required_by is the reverse of depends/makedepends/checkdepends we can just destroy the cache when package A updates and iterate over all of their dependsmakedepends/checkdepends and destroy the cache.

Obviously reading the repository db will become a bit slower then..