fedora-copr / copr

RPM build system - upstream for https://copr.fedorainfracloud.org/
113 stars 61 forks source link

Fedora Asahi kernel builds are way too slow (~6 hours) #2925

Closed Conan-Kudo closed 1 year ago

Conan-Kudo commented 1 year ago

Something is wrong with the Fedora COPR builders, because they went from being able to build kernels in roughly a couple of hours to taking 6 hours to build.

For comparison of some older builds:

To current builds:

The amount of time these builds take is unreasonably long and it makes it very difficult for me to ship things in a timely fashion.

Even with figuring out these bottlenecks, it'd be great to upgrade the instances selected so we get more out of it more quickly.

Based on what I see for the COPR instance provisioning, it looks like we're using i4i.large for x86_64 and c7g.xlarge for aarch64.

Could we please look into bumping up to a larger instance? And maybe getting instances that have dedicated NVMe to not have I/O bottlenecks? Something like c7g.12xlarge or c7gd.4xlarge would be tremendously helpful for AArch64.

x86_64 has similar problems and could benefit from an upgrade to c7i.2xlarge.

If experiments are needed to find a happier medium, we're happy to help.

cc: @marcan @davdunc @davide125

praiskup commented 1 year ago

Thank you for the report. Can we do this in #2241?

Conan-Kudo commented 1 year ago

Sure. This is technically two issues anyway:

Conan-Kudo commented 1 year ago

Since there's been no progress on #2241, could the instance types be upgraded? It would generally get everything to move much faster if we have upgraded instances, and we skip bootstrap these days with mock 5+...

praiskup commented 1 year ago

Since there's been no progress on https://github.com/fedora-copr/copr/issues/2241, could the instance types be upgraded?

2241 is ready, we need to deploy it (define a new "on demand" pool of workers).

could the instance types be upgraded?

We don't want to make Copr overly demanding (99% of builds would be doable on slower machines, so we could eventually decrease the power in the future and rather have more builders to better parallelize).

and we skip bootstrap these days with mock 5+...

Can be done on a per-chroot/per-copr basis, sure. This will give you minimal speedup, though.

praiskup commented 1 year ago

Something is wrong with the Fedora COPR builders, because they went from being able to build kernels in roughly a couple of hours to taking 6 hours to build.

Reading again ^^^, are you sure something changed in Copr/AWS? Isn't this a package building problem? To the best of my knowledge, we haven't changed the instance type since your last request (a1.xlarge => c7g.xlarge).

Conan-Kudo commented 1 year ago

Do we run the builds in tmpfs or on disk?

praiskup commented 1 year ago

Tmpfs (may overflow to swap and consume a lot of I/O)

Conan-Kudo commented 1 year ago

Oh, then we want memory-optimized instances, then. r7g.16xlarge (AArch64) and r7a.16xlarge (x86_64) are better choices.

praiskup commented 1 year ago

This still doesn't answer why the builds take 3x more now. Is this worth reporting against EC2?

Conan-Kudo commented 1 year ago

This still doesn't answer why the builds take 3x more now. Is this worth reporting against EC2?

It takes at least double because now there are double the flavors being built. What happened is it became 2.5 hours with Rust, then I doubled the flavors because now there's 4K and 16K, and then more code got turned on, and here we are at 6~7 hours.

marcan commented 1 year ago

Just to be clear: enabling Rust in kernel builds should have a negligible impact on build times. It is an insignificant amount of code compared to the rest of the kernel, and all kernel Rust code takes less than one minute to build with -j1.

If flipping Rust on caused our COPR kernel builds to be measurably slower, my understanding is that it must be because the builders are so ridiculously undersized right now that they are already running into thrashing issues, and rustc's moderately higher peak memory usage vs. gcc is causing an even more pathological situation there.

praiskup commented 1 year ago

I think we can close this request as a redundant one finally.

We moved the default x86 machines from i4i.large to c7i.xlarge which performs the x86_64 builds roughly twice as fast. But we do not use those normal EC2 builders as long as we can handle the throughput with our 4 x86 hypervisors (cloud cost saving, so "normally" nothing changes here). See the discussion in #2241 which is the duplicate of this one anyway.

We enabled the powerful builders for Asahi project(s) in #2966, which should handle the build in less than 40 minutes. The overall build takes more than that because we have the keygen slowdown #2757. But that is a separate issue.

Please reopen or at least feel free to comment. Happy building!