canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.27k stars 910 forks source link

Images failing to build for a long time get evicted from `images:` #13634

Closed simondeziel closed 1 week ago

simondeziel commented 1 week ago

It seems Debian sid (last built successful) two weeks ago. Since stale images are evicted after a few days, it means that images:debian/sid is no longer available for download. https://images.lxd.canonical.com/images/debian/sid => 404.

Considering that we are failing to address build failures swiftly and that experimental/unstable distro often fail to build, we should revisit the pruning policy to either fire after a much longer period or not fire as long as the image in question is supposed to be built (not EOL/removed from git).

tomponline commented 1 week ago

we should revisit the pruning policy to either fire after a much longer period or not fire as long as the image in question is supposed to be built (not EOL/removed from git).

Not keen on this. We don't want to be serving stale or out of date images. If we cant address the build failures in a timely manner then we shouldn't be committing to provide them in my view.

So more time should be allocated to address these issues, at least to ascertain what the issue is and log a gh issue.

tomponline commented 1 week ago

@simondeziel are you able to take a look at the Debian sid issue?

tomponline commented 1 week ago

@simondeziel also the url in your report is for linuxcontainers.org which isnt managed by lxd team.

setharnold commented 1 week ago

imho Debian sid or Ubuntu devel are "special" and serving the most recent successful build, even if it's a month old, is the right approach: transitions can sometimes take a while. Having easy access to a pre-broken image can help people find the causes of upgrade failures and fix them.

Of course there's the alternate view that if the image generation is broken, it'll be easier to spot if the images are discarded after a week. But maybe this can be addressed with a warning, "warning: the image is %d days old" or something.

tomponline commented 1 week ago

That's a fair point, but the lxd-imagebuilder doesn't yet have the capability to define per-image eviction policies. Nor does it have the ability to emit warnings like suggested. A warning in an automated workflow is likely to go unnoticed too.

But my main concern is around publishing stale images in the simplestreams index with security vulnerabilities in them.

We saw with xz incident that the vulnerable versions landed in the edge builds first and so serving stale images can be risky even on edge builds.

I feel like there is a potential middle ground here whereby we could update the lxd-imagebuilder pruning logic to not delete stale images, but rather remove them from the simplestreams index. This way normal consumers via lxc tool will stop being able to consume them until they are fixed and updated. However users who want access to the pre-broken image files for analysis can still find and download them via https://images.lxd.canonical.com/ for manual import.

tomponline commented 1 week ago

I've looked at https://github.com/canonical/lxd/labels/Imagebuilder and cannot see an issue has been reported around the specific Debian edge build issues. I'm going to follow up with @simondeziel and @MusicDin to understand whether this got missed (we have daily checks to look at build failures in our pipeline) and what can be done to fix the Debian edge builds (potentially by pinning the specific problem packages to an earlier version for now).

In the meantime if you would like to look into the issue, the lxd-imagebuilder can be used with Debian recipe to build an image locally.

tomponline commented 1 week ago

This PR should solve the Debian build problem https://github.com/canonical/lxd-imagebuilder/pull/70

And we've agreed internally to increase the time that stale images are kept around for to 15 days (up from 10) to give us more time to resolve such issues without indefinitely extending (or significantly increasing) the time stale images are kept published.

Additionally we are going to be looking at better ways to get notifications around build failures rather than the manual check that is done daily currently (and has proven in this case to be error prone).