How are build systems supposed to choose configurations of transitive dependencies?

cps-org / cps

Common Package Specification — A cross-tool mechanism for locating software dependencies

https://cps-org.github.io/cps/

Other

91 stars 8 forks source link

How are build systems supposed to choose configurations of transitive dependencies? #27

Open brauliovaldivielso opened 6 months ago

brauliovaldivielso commented 6 months ago

Say we have:

A package A with a component A. A:A has a Requires on B:B.
A package B with a component B. B:B has two configurations, B-config1 and B-config2.

And now say that the user of a CPS-compatible build tool wants to build their project against A:A. The build tool will have to transitively process B:B, and will have to choose whether it consumes the B-config1 configuration or the B-config2 one. How do you envision that tools should do this? I can think of three different approaches:

The user of the build tool explicitly specifies the desired configurations even for their transitive dependencies. So the user, somewhere in their build config, would have to specify "For B:B, I want to use B-config1". This is not very scalable, specially as dependency trees become deeper and deeper. For deep dependency trees, the user might not even know about B, so they wouldn't necessarily be capable of choosing between config1 and config2.
Requires in components should explicitly specify their desired configs. A:A, instead of Requiresing B:B, would explicitly require B:B@B-config1 (which is already permitted by the specification). This is an expectation we would be placing on packagers.
Packages only define configurations with semantics that are well known to the build tool. When the build tool finds B-config1, it doesn't know what B-config1 means, so it cannot choose for the user. However, if the configurations were just static and shared, the build tool could choose itself to use static if the user expressed a preference for static linking. Build tools could support other well-known configuration names, and packages and components would be created that provide only such well-known configurations. This would imply a contract between package manager/packagers/build tools about what configuration names have well-known semantics (like static, shared, ...)

The specification doesn't necessarily have to prescribe one of these three options. The current version allows for the three, and it would be a matter of how the ecosystem grows around it. Perhaps the three can coexist for different use-cases.

But maybe you already have something in mind, and it could be useful to provide guidance in the site.

bretbrownjr commented 6 months ago

@drodri has already been discussing the concept of "dependency maps"... that the build system would be taking logical but perhaps imprecise requirements and turn them into a coherent set of physical dependencies to use. That is, precise CPS components to use. I'm not sure if we need a spec for how to export and share CPS dependency maps or if we need it yet? But it's definitely a useful concept to frame design around.

At any rate, I think broader dependency specifications are mostly the right way to go. In CPS data and in files under source control (i.e., build configurations, package metadata, etc.), we want to depend on a broad foobar or foobar:unspecified instead of foobar:flavor1 or foobar==v1.2.3 or whatnot. It's OK to have some version ranges or conflicts declared to assist downstream users in avoiding known incorrect maps of packages, but I would consider that an optimization and not a core requirement as such.

I agree that the CPS files and the build system need to understand one another. I believe the simplest way to do that is to just use the same, broad name in all places. Though it's possible, using the "dependency map" concept, to have rules in various places to remap one name to another as needed. I expect we'll need this for use cases like zlib/libz/z and such. It's possible to model renaming with CPS files that depend on other CPS files, but we do want people to converge on consistent spelling eventually, I believe.

memsharded commented 6 months ago

Hi @bretbrownjr

Please use my @memsharded handle, it is the one that I am using now. Thanks very much for the ping!

I think the main issue with configurations in the upstream is the existence of conflicts when there are diamond dependency graphs:

pkga -> pkgb -> pkgd:config1
pkga -> pkgc -> pkgd:config2

For normal cases, it is typically impossible to link 2 different configurations of pkgd so this would be a configuration conflict. It is my proposal that this kind of conflict detection and resolution do not belong to the build system step to which the CPS specification is targeted. Instead it belongs to the "dependency resolution" system, which in my view of the CPS could be even the user manually building things from source. When CPS files are fed into the consumer build system, they should have already been reconciled/converged/resolved to one common configuration.

I am not saying the configuration conflict problem shouldn't be addressed, but that this problem is common to all build systems, and as such it is better if we don't make all build systems have to re-implement themselves the same logic, specially when there will be some build systems (rather than CMake that is leading this) that will probably take much longer to implement, and it could slow down adoption of CPS a lot. Specially taking into account that if we factor the possibility of handling different versions through open version ranges and different versions can (and will) have different configurations, then the problem becomes NP-hard, and this can get challenging if dependency graphs start to get larger and larger.

On the other hand if there is a reasonable mapping from CPS files to the different build systems, that we know it is possible because we have experience (MSBuild .props files, Meson toolchain and PkgConfig files, etc), then this could help a lot for faster adoption of CPS.

I have been trying to allocate some time to do some work in this repo, this is one of the first discussions I wanted to open, very good point to bring @brauliovaldivielso

brauliovaldivielso commented 6 months ago

Yeah I don't disagree that version resolution, config resolution, etc... could all belong in the package manager (or the user, if they are crafting their own mutually-consistent library distribution manually). The package manager would just give the build system a bunch of mutually-consistent set of CPS files that do not require any smart resolution by the build system.

But just to make sure I understand how far you want to go with that idea, @memsharded, what do you think of static vs shared CPS configurations? I think one of the main use-cases for having a notion of configuration in CPS is precisely for a CPS unit to support having a static build and a shared build for users to choose. In your worldview, would it be the user that tells the package manager "I want my project to link statically against all my dependencies", and then the package manager would download/build everything it needs and create a bunch of CPSs files where components do not have any configurations? (and the default configuration is set to point to a static archive).

When CPS files are fed into the consumer build system, they should have already been reconciled/converged/resolved to one common configuration.

Because if this is true, do you see any value in having a notion of configuration in CPS?

Edit: though to be fair, I could see how one could want to link against different configs (i.e static vs dynamic) in the same build of a repository. Say that I have a monorepo that builds an executable (./service-executable) that I deploy to my company's servers, and another executable that I distribute to my users (./user-executable). And say that both of those link against some libdependency. Because I have good control over my servers, I might want to link service-executable dynamically against libdependency, but I might prefer to make a self-contained executable for my users, so I would link user-executable statically against libdependency. There's no diamond problem in this case because service-executable and user-executable are different executables, but the package manager can't hand out to the build system a bunch of CPS files with all the configurations already unified/resolved.

memsharded commented 6 months ago

But just to make sure I understand how far you want to go with that idea, @memsharded, what do you think of static vs shared CPS configurations? I think one of the main use-cases for having a notion of configuration in CPS is precisely for a CPS unit to support having a static build and a shared build for users to choose.

This is exactly the type of conflict that will be nasty to resolve, because the binary artifacts of pkgc and pkgb created against different shared and static configurations of pkgd will be radically different, and too late to be reconciled by any a posteriori tool like a build system. The only way is something that allows to explictly resolve the conflict saying "I want pkgd as shared (or static)" and that will necessarily force having to install/build/download the necessary variants of the pkgc and pkgb.

One of the other problems with configurations is the combinatoric explosion of configurations down the graph. pkgc and pkgb would inherit and extend (in a multiplicate sense) the configurations of pkgd, then they will have configurations shared-with-pkgd-static, static-with-pkgd-static, shared-with-pkgd-shared, static-with-pkgd-shared. And that is the first level of dependencies, and just 1 dimension (shared/static). Now if we factor the build-type,, then multiply those by at least 2 build types per level, that is 4 for just this example, total 16 configurations at the first pkgc and pkgd level. While it is true that the problem can be simplified for some cases, like Windows for example, where only same build-type can be linked together, this is not a general rule.

So in our experience, the only way to manage this combinatoric explosion is in the "imperative" way. The user decides. I want my dependencies, all of them in "Release" mode, and all "shared" libraries, and that will be built/installed, and the correct CPS files with the correct information, passed to the build system.

Because if this is true, do you see any value in having a notion of configuration in CPS?

Absolutely yes. CPS are to be used by the whole ecosystem. The fact that build-systems shouldn't be the one dealing with conflict resolution for configurations, it doesn't mean that the config information shouldn't be there. Packages can have configuration information in them. The problem is trying to have multiple configurations inside the same CPS, which is the thing that in my opinion could be extremely problematic to make it work at scale.

Edit: though to be fair, I could see how one could want to link against different configs (i.e static vs dynamic) in the same build of a repository. Say that I have a monorepo that builds an executable (./service-executable) that I deploy to my company's servers, and another executable that I distribute to my users (./user-executable). And say that both of those link against some libdependency

You have 2 dependency graphs in this case:

service -> pkgc -> pkgd:static
userexe -> pkgb -> pkgd:shared

In the same way you can define different target_link_libraries(service ... pkga pkgb...), target_link_libraries(userexe ... pkgj pkgk...), the things that are linked into an executable can be different, and different CPS files can be given to the different built targets. This is more or less the idea where I think a mapping is necessary, because that is the only way it could scale with the necessary flexibility to cover these not that straightforward cases.

brauliovaldivielso commented 6 months ago

Packages can have configuration information in them. The problem is trying to have multiple configurations inside the same CPS, which is the thing that in my opinion could be extremely problematic to make it work at scale.

Well, to the best of my understanding, CPS's notion of configuration involves putting multiple configurations inside the same CPS. See this example, where the foo component has two configurations (Static and Shared) and the foo-static component also has two configurations (Release and Debug) and all of those are part of the same CPS. (Though putting different configurations in separate CPS files is also allowed).

In the same way you can define different target_link_libraries(service ... pkga pkgb...), target_link_libraries(userexe ... pkgj pkgk...), the things that are linked into an executable can be different, and different CPS files can be given to the different built targets.

so in your view, my package manager would give my build system two different CPSs files for pkgd. One of those CPS files would be for the static pkgd, and the other CPS file would be for the shared pkgd. And in my build system, I would have to spell out which of the two I want for each executable that I'm building?

I don't immediately see how this is better than having two configurations (static/shared) inside the same CPS file that describes pkgd. It'd be interesting to see a more fleshed out description of how this mapping would work, and how it relates to the problems above.

memsharded commented 6 months ago

Well, to the best of my understanding, CPS's notion of configuration involves putting multiple configurations inside the same CPS.

Yes, this is our view from our experience, not necessarily aligned with the current CPS state in the repo. We have pending to have some time to contribute and start discussing these points with everyone. I think @bretbrownjr mentioned me because of this.

all of those are part of the same CPS. (Though putting different configurations in separate CPS files is also allowed).

This is more or less our point, that they should be in different CPS files, not in the same CPS. It might look like a small style thing on the surface, but in our experience it is not.

so in your view, my package manager would give my build system two different CPSs files for pkgd. One of those CPS files would be for the static pkgd, and the other CPS file would be for the shared pkgd. And in my build system, I would have to spell out which of the two I want for each executable that I'm building?

I don't immediately see how this is better than having two configurations (static/shared) inside the same CPS file that describes pkgd.

Let me forward the question to you. How would a CMakeLists.txt look like that define such linkage? In particular this problem:

service -> pkgc -> pkgd:static
userexe -> pkgc -> pkgd:shared

The pkgc doesn't have any components, lets say it is always a shared library. Having some target_link_libraries(service pkgc) and target_link_libraries(userexe pkgc) will not be good enough. Defining target_link_libraries(service pkgc pkgd::static) might unnecessarily introduce over linkage.

Let me also propose a very similar yet different scenario. You can perfectly have a monorepo for service and userexe and for management, historic, business or other reasons you need service to use pkgd/1.0 and userexe to use pkgd/1.3.

I know it is less than ideal but I have seen this a lot of times in many orgs, from small to some of the largest C++ companies in the world. There is also no reason to limit the capabilities of a CPS specification to not support this case in my opinion, and in practice and from the implementation point of view is basically the same problem as managing different configurations.

Bringing the versions into the problem also shows the combinatorial problem of configurations. Lets consider this:

service -> pkgc (shared) -> pkgd:1.0 (static)
userexe -> pkgc (shared) -> pkgd:1.3 (static)

Even if we don't consider the case of a mono-repo with a single build, lets just consider service and userexe now as separate independent projects that we are building in the same machine, same compiler, same everything. Then, we have a problem with pkgc, because we need two different binaries (two different builds from sources) of it, one linking pkgd:1.0 and the other linking pkgd:1.3. Are we going to have 2 configurations in pkgc called with-pkgd-1.0 and with-pkgd-1.3?

So in my opinion at this point it is better to leave this outside of CPS specification and focusing on the most basic, well known and yet massively useful things that would be to be able to define include-paths, lib paths, link libraries, preprocessor definitions, etc.

The mapping, conceptually would be the result of defining the dependency graph: if we have the above, then we are defining 2 sets of inputs (user inputs, this is the end user that defines what they want):

service: depends on pkgc:1.0 as shared library and pkgd:1.0 as static library
userexe: depends on pkgc:1.0 as shared library and pkgd:1.3 as static library

And each definition of inputs would define a mapping that can later be used in the build scripts.

I think it is too early to try to give this mapping some more thorough description, it is better to share and put in common the potential issues, problems and proposal, discuss with everyone, and then probably try to put together in collaboration some proof of concept.

brauliovaldivielso commented 6 months ago

Let me forward the question to you. How would a CMakeLists.txt look like that define such linkage? In particular this problem:
service -> pkgc -> pkgd:static
userexe -> pkgc -> pkgd:shared

ok so if I understand your mapping proposal correctly, the process would be something like (roughly):

The user tells their package manager to install pkgc, pkgd:static and pkgd:shared.
The package manager does all the builds/downloads it needs and generates/gets three CPS files, one for each package.
Either the user tells their build system where the generated CPS files are located, or the package manager generates a toolchain file for the build system, or in whatever other way, the build system becomes aware of where it has to look for the CPS files.
The user, in their CMakeLists.txts, just writes target_link_libraries(service pkgc) and target_link_libraries(userexe pkgc).
However, crucially, the user also has to provide a mapping that goes from their (in this case) applications to a list of the transitive dependencies of those applications. In pseudojson, the mapping would be something like { service: [pkgc, pkgd:static], userexe: [pkgc, pkgd:shared] }
The build system will also be aware of this mapping, and will use it to find the relevant CPS files to get the right artifact locations, flags, ... when building each of the executables.

Is this what you have in mind?

Let me also propose a very similar yet different scenario. You can perfectly have a monorepo for service and userexe and for management, historic, business or other reasons you need service to use pkgd/1.0 and userexe to use pkgd/1.3.

Yes, to be fair, the way I thought about it was that a given package manager would create, say, a directory, full of CPS files that are known to be mutually compatible. So that folder would have a CPS file for pkgd/1.3 or pkgd/1.0, but not both, and so we would only need a single build of pkgc (but also possibly static and dynamic configurations, or other configurations, to the extent they don't lead to ABI issues). This requirement of consistency doesn't just apply to versioning, but also to f.i preprocessor definitions that pkgd and pkgc were built with.

The build system would be pointed to that mutually-compatible CPS directory curated by the package manager, and the build system would only look in it. So if you have a CPS file for the other version of pkgd/1.3 somewhere in your disk, it wouldn't matter, because your build system is not looking there. (Something has to be done for system libraries, maybe the package manager creates CPS stubs for them in that mutually-compatible CPS directory - I don't know, but it's not super relevant right now).

If the user wants to build service (transitively) with pkgd/1.3, and userexe (transitively) with pkgd/1.0, then they'd need two different invocations of the package manager to create two different mutually-compatible CPS distributions/folders (one with a pkgc that was built with pkgd/1.3; the other one with a pkgc that was built with pkgd/1.0) . And then they'd do two different build system invocations, one for service where they point their build system to the pkgd/1.3 CPS distribution, and one for userexe where they point their build system to the pkgd/1.0 CPS distribution.

Now I know this may be too rigid in some cases (I remember an example you brought up where some people might want to link both versions of pkgd into the same executable, as long as they are careful that no objects from one version are passed to the other one, and the visibility of the symbols is configured correctly). My mutually-consistent view of CPS "distributions"/"directories" is not flexible enough to support that use-case. (The CPS specification as it is is flexible enough to support something like this, I believe, but I don't know what the workflow would be like for the user)

memsharded commented 6 months ago

Is this what you have in mind?

Yes, I guess something in that line. I am not even thinking yet about the mapping format and syntax, but yes. Maybe just a minor clarification about The package manager does all the builds/downloads it needs and generates/gets three CPS files, one for each package., which I think it actually might be 4 CPS as there will be 2 CPS files for pkgc one for each different build.

Put it simply, I think that every build from source of a library for a configuration should output the binaries in a different folder, like pkgc.lib or libpkgc.a, so the builds of pkgc against different versions or configurations of dependencies doesn't collide and overwrite each other. And it should similarly output a CPS file for that build that represents such binaries.

Then the mapping is the responsible of giving to the build system the right folder, for one executable it will give the folder with one build of pkgc (and its CPS file) and for the other executable it should use the other binaries of pkgc with its other CPS file.