dotnet / source-build

A repository to track efforts to produce a source tarball of the .NET Core SDK and all its components
MIT License
264 stars 129 forks source link

build-server shutdown causes issues in the VMR orchestrator #4175

Open ViktorHofer opened 6 months ago

ViktorHofer commented 6 months ago

We use the build-server shutdown CLI command in the VMR orchestrator after each repo build so that we can clean the repository's artifacts (including the package cache). There are two issues with that appraoch:

  1. dotnet build-server shutdown will only shutdown the compiler server from its own install: https://github.com/dotnet/sdk/issues/10358#issuecomment-507119296. Repositories that use the toolset compiler package can't be cleaned-up correctly as the compiler assemblies in the package cache are locked.
  2. When enabling parallel build in the VMR the build-server shutdown execs could cause issues, hopefully just perf and not stability.

Is there a better way to handle this without killing all build-servers and only for the process tree spawned per repository? If not, should we disable shared compilation completely? What if we would disallow toolset compiler packages entirely and require the VMR to build with the compilers within the SDK? While that might me possible at some point, there are times when i.e. runtime depends on a newer language feature than what's in a given SDK.

I meanwhile worked around is for razor which was hitting this issue by disabling the compiler toolset package: https://github.com/dotnet/installer/pull/18776

cc @jaredpar @mmitche @dotnet/source-build-internal

mthalman commented 6 months ago

Here's some history on the use of build-server shutdown: https://github.com/dotnet/installer/pull/15488

mmitche commented 6 months ago

Repositories that use the toolset compiler package can't be cleaned-up correctly as the compiler assemblies in the package cache are locked.

This is surprising. In source-only modes it's pretty important to not use the toolset package to avoid prebuilts. I generally think we should not allow use of the toolset package in all VMR modes. Where were you seeing that?

mmitche commented 6 months ago

I think the right thing to do is probably propagate the remotes from the outer->inner

jaredpar commented 6 months ago

Soapbox Start I honestly think that the build-server shutdown approach is the wrong direction overall. What we actually desire is a dotnet build command which, upon completion, terminates all of its child processes. This type of build end event that can be used for tasks to clean up and shut down gracefully. The build-server shutdown command is instead more of a dotnet build didn't do the right thing so lets clean up after it.

Imagine for a second if we didn't need to have a bunch of places in our infra where we called dotnet build-server shutdown , kill VBCSCompiler, plumbing through -nr:false or ExitWithExitCode (then hours plumbing $ci and $prepareMachine through the build scripts and YML files). What if instead dotnet build was a self contained event? Soapbox End

I'm not sure what the design of build-server shutdown is anymore (been a while since we implemented it). If someone can point to the docs we can see why it doesn't work for the toolset compiler.

ViktorHofer commented 6 months ago

This is surprising. In source-only modes it's pretty important to not use the toolset package to avoid prebuilts. I generally think we should not allow use of the toolset package in all VMR modes. Where were you seeing that?

I.e. when enabling razor: https://github.com/dotnet/installer/pull/18776. See the conversation in there and the builds that I queued. At the end I just disabled the toolset package as I wasn't able to kill the actual process that still had a lock on the toolset compiler assemblies.

I'm not sure what the design of build-server shutdown is anymore (been a while since we implemented it). If someone can point to the docs we can see why it doesn't work for the toolset compiler.

@marcpopMSFT would you know who owns the build-server CLI frontend and the protocol to the compiler servers?

jaredpar commented 6 months ago

@baronfel may know where it is too.

baronfel commented 6 months ago

There's not really a unifying protocol for the different build servers - there's just an interface with a void Shutdown() method that we have a few different implementations of

There's not a ton of commonality.

SDK generally owns the frontend here, though I expect different teams have had the source ideas about how to shut down their own servers.

ViktorHofer commented 6 months ago

Even if we would teach Roslyn's shutdown implementation to kill PIDs based on files on disk (what razor does), we would still shutdown all of the repo build servers, right?

For the VMR, we are building multiple repositories in parallel with a single SDK (single dotnet) and only want to terminate the running PIDs that were triggered for the individual repo build.

Example: roslyn and runtime build in parallel and when roslyn finishes, we only want to kill the PIDs that were opened for the roslyn repo build and not the ones currently running for the runtime repo build. Is that somehow possible today with build-server? If not, should we disable shared compilation completely? Alternatively, what if we just look for the open handles and kill the corresponding PIDs ourselves?

baronfel commented 6 months ago

Today, unless Roslyn and Runtime were using different SDKs (different layouts on disk) then we have no way to shutdown just the servers for each individual repo using the build-server shutdown command. If this is a recurring problem we could design some extensions to build-server shutdown that would accept some discriminator for the PIDs/servers and let you specify a set to terminate I suppose, but we haven't really thought about this space in detail for a while so we don't have any designs ready-to-go.

agocke commented 6 months ago

When enabling parallel build in the VMR the build-server shutdown execs could cause issues, hopefully just perf and not stability.

Roslyn's version is perfectly safe. You can kill it an any time and it will only impact perf.

agocke commented 6 months ago

Is there a better way to handle this without killing all build-servers and only for the process tree spawned per repository?

I think the long-term solution is that we shouldn't have to be cleaning up extra artifacts. The VMR should be reasonably efficient once we get used to building the product in one big run. I don't see the need to do intermediate cleanup.

jaredpar commented 6 months ago

There is a list of files in the user's home directory that track PIDs and dll paths for existing Razor build servers

Was PID recycling considered in this design?

MSBuild delegates the shutdown entirely to the engine loaded from the current SDK.

Is there a way that tasks can hook that engine? Like can I register an IBuildServerOwner factory where I can control the shutdown in the same code that launched the compiler server?

MichaelSimons commented 6 months ago

Is there a better way to handle this without killing all build-servers and only for the process tree spawned per repository?

I think the long-term solution is that we shouldn't have to be cleaning up extra artifacts. The VMR should be reasonably efficient once we get used to building the product in one big run. I don't see the need to do intermediate cleanup.

IDK, I see this as more of a repo level problem, there is a lot of content in the individual repo artifacts directories (e.g. 34GB in a linux Source build). Distro maintainers were hitting size limits in their build systems which is what prompted the changes to optionally clean after each repo build.

agocke commented 6 months ago

As a long-term goal we probably want to build the VMR as fast as possible and generating extraneous artifacts in the repo builds probably impedes the perf goals. I expect we should get rid of that stuff through natural optimization.

rainersigwald commented 6 months ago

MSBuild delegates the shutdown entirely to the engine loaded from the current SDK.

Is there a way that tasks can hook that engine? Like can I register an IBuildServerOwner factory where I can control the shutdown in the same code that launched the compiler server?

No but I think it's reasonable to want and build this.

ViktorHofer commented 5 months ago

I honestly think that the build-server shutdown approach is the wrong direction overall. What we actually desire is a dotnet build command which, upon completion, terminates all of its child processes. This type of build end event that can be used for tasks to clean up and shut down gracefully. The build-server shutdown command is instead more of a dotnet build didn't do the right thing so lets clean up after it.

Here's another recent incident that highlights why we really need this.

When building the product from source, analyzer assemblies that target the compiler API either use the N-1 or live version of it. N-1 when the repo doesn't depend on the roslyn repo being built first and live when it does. That means that the compiler used during the build (which loads these live built analyzer assemblies) must match. Here's what happens when it doesn't:

CSC : error CS9057: The analyzer assembly '/vmr/src/sdk/artifacts/sb/package-cache/microsoft.aspnetcore.analyzers/9.0.0-preview.4.24177.3/analyzers/dotnet/cs/Microsoft.AspNetCore.Analyzers.dll' references version '4.11.0.0' of the compiler, which is newer than the currently running version '4.10.0.0'. [/vmr/src/sdk/src/WebSdk/Web/Tasks/Microsoft.NET.Sdk.Web.Tasks.csproj::TargetFramework=net9.0]

from https://github.com/dotnet/installer/pull/19221

Source-build doesn't enforce the use of the N-1 / live compiler toolset package which results in the nonmatch. If a repository doesn't explicitly use the compiler toolset package, the compiler from the SDK is used. Whenever the compiler API revs its major/minor version and the compiler toolset package isn't used, the build starts to fail.

If source-build would now starts enforcing the compiler toolset package then the current build-server shutdown approach in the VMR would stop working: https://github.com/dotnet/dotnet/blob/73e529885113639d84f71652c93a9acacb808719/repo-projects/Directory.Build.targets#L521

I wonder how difficult it would be to design the protocol to achieve what Jared mentioned above.

jaredpar commented 5 months ago

I'm very open to ideas on how to solve this. Spent a bit of time trying to reason this out but can't find a way that is sufficiently robust to make me want to use it.

The basic problem is that we need a way to identify either the VBCSCompiler processes or the location they were launched from. Ideally the processes though cause killing those is more robust. The ideas I've considered are the following:

  1. Have every server write a key to a place on disk. There are multiple keys I could use (named pipe name, location, etc ...). The build-server shutdown could then use that to kill the launched processes. Problems: a. Ask three runtime devs where you can validly write a file on Unix and you will get four different answers b. How do you avoid polluting the user file system over time?
  2. Have msbuild implement a "build start / stop" event and just kill VBCSCompiler on build stop. Basically make it a one shot deal.

Think there is potential for (1) to be a solution but can't get any kind of consensus on where a valid location for writing files is. Also it would only be a solution going forward.

akoeplinger commented 5 months ago

Have every server write a key to a place on disk

This is actually what the razor server does already: https://github.com/dotnet/sdk/blob/main/src/Cli/dotnet/BuildServer/RazorServer.cs

It uses a path in the user profile: https://github.com/dotnet/sdk/blob/main/src/Cli/dotnet/BuildServer/BuildServerProvider.cs#L72

edit sorry, just realized this was already mentioned earlier in the thread 😄

baronfel commented 5 months ago

One semi-serious proposal for this - to avoid bloating use a sqlite db for the data?

jaredpar commented 5 months ago

This is actually what the razor server does already:

They also write PIDs which means it's subject to PID recycle issues.

It uses a path in the user profile

Understood but when you ask runtime team if this is the right decision you get my "ask three people and you get four answers" problem.

akoeplinger commented 5 months ago

Right, I just wonder given the Razor precedent whether that solution is "good enough" to make progress even though it doesn't cover every obscure setup. There is an env var to override the PID file location as an escape hatch.