bazelbuild / rules_python

Bazel Python Rules
https://rules-python.readthedocs.io
Apache License 2.0
538 stars 542 forks source link

Generate one Bazel repo per dependency #2370

Open mering opened 2 weeks ago

mering commented 2 weeks ago

πŸš€ feature request

Relevant Rules

pip.parse() module extension

Description

When the same pip package is (transitively) depended upon multiple times from different requirements.txt files, it is undefined which version will be chosen (both are added to the PYTHONPATH and the order of deps decides which one takes precedence). This is especially problematic for transitive dependencies of module dependencies where one only has limited to no influence of which version is chosen. This is especially problematic for fundamental libraries like Protobuf which gets pulled in transitively in many requirements.txt. The way this is handled in golang, is that every dependency is its own Bazel repo and as such takes part in version resolution. This makes sure that every package only exists once within a given Bazel workspace.

Describe the solution you'd like

Generate one Bazel repo per pip dependency.

Describe alternatives you've considered

For the transition period, rules_python could generate both, one repo for all of requirements.txt and one repo per dependency. This way, users can easily migrate from one approach to another. In a future major version, the old functionality could be removed.

groodt commented 2 weeks ago

I think this is working as expected. To confirm, can you please create a reproduction repository to demonstrate the issue?

mering commented 2 weeks ago

@groodt What do you mean by "working as expected"? The way it currently works what is expected from the current architecture, but the current architecture is problematic.

mering commented 2 weeks ago

Example:

Module A:

Module B:

Module `C:

The Protobuf version available in @C//:c depends on the order of its dependencies. This becomes even more confusing when there are additional levels of transitive dependencies involved.

groodt commented 2 weeks ago

The recommendation is to produce a single lockfile for your dependencies and register that in your hub repository. See https://rules-python.readthedocs.io/en/latest/pypi-dependencies.html

Mixing dependencies across dependency closures as you have in your example is indeed problematic and not supported. You may wish to explore other rules that might offer the behaviour that you desire.

mering commented 2 weeks ago

The recommendation is to produce a single lockfile for your dependencies and register that in your hub repository. See https://rules-python.readthedocs.io/en/latest/pypi-dependencies.html

Mixing dependencies across dependency closures as you have in your example is indeed problematic and not supported. You may wish to explore other rules that might offer the behaviour that you desire.

This is not possible for dependencies added from BCR which add their own Python code and are not available on pypi.org. We do not have control over these dependencies. I can find at least 170 such package versions currently: https://github.com/search?q=repo%3Abazelbuild%2Fbazel-central-registry%20pip.parse&type=code. This problem will grow every day as more packages migrate to Bzlmod and are uploaded to BCR.

rickeylev commented 2 weeks ago

Unfortunately, a canonical target for every dependency doesn't Just Work, either.

Going back to the Module A/B/C example, lets pretend there is a single repo for each dependency. We know that there are different requirements.txt files, which means there are different "concrete" implementions for a dependency (in the example, protobuf).

Which concrete implementation to use depends on the final consumer (e.g. the binary / requirements.txt), not the point of need (module A's target needing protobuf). However, an upstream target can't know what its downstream targets's constraints are. i.e., module A's target can't know whether it's module A, module B, or module C that is consuming it. Thus, the point-of-need can't specify what concrete implementation to use. Similarly, the final consumer can't arbitrarily override what every point-of-need says.

re: "[libraries] takes part in version resolution": part of how requirements.txt works is you can arbitrarily pin a dependency to a version. It's then up to the resolver to find a closure with that constraint. Solving that is outside the scope and ability of Bazel (solving a dependency closure's constraints can take several minutes sometimes). My point here is that picking which concrete implementation to use isn't simply decided based on python version or OS. It can be arbitrary and out-of-band information.

The only way to make something resembling this work is something like:

  1. Every dependency has a flag to select which concrete implementation to use
  2. Binaries accept an arbitrary number of flags they transition on to affect (1)

This isn't a very tenable design at scale, though. It's fragile, unwieldy, and leads to configuration state explosion.

The way this is handled in golang, is that every dependency is its own Bazel repo and as such takes part in version resolution. This makes sure that every package only exists once within a given Bazel workspace.

I think this is the basic point of misunderstanding -- rules_python supports multiple dependency closures in a single build. This is based on user request -- not every repo is a monorepo with a strict one-version policy as enforced within Google.

As a general rule, you can't mix "bazel dependency management" (e.g. implicit deps on library rules) with "external dependency management" (e.g. requirements.txt) when it comes to user dependencies. The two methods have separate dependency closures and aren't aware of each other, and thus can't safely interoperate.

This is not possible for dependencies added from BCR which add their own Python code and are not available on pypi.org. We do not have control over these dependencies.

In this situation, the best a BCR module can do is:

A mediocre middle ground might be to do something like:

This is, effectively, similar to the deprecated bind() function from workspace; it's a form of late binding. However, as I described above, given a repo with multiple requirements.txt, there can be conflicting answers for what the correct and neccesary binding is. (I'll note this is somewhat already possible today by having the implicit dep point to a user-defined flag, or code generated by workspace/bzlmod configuration).

aignas commented 2 weeks ago

Related issue - #2094

groodt commented 2 weeks ago

Yes, this comes up from time to time.

I'd still value a realistic reproduction in a realistic repo if thats possible please @mering? Most of the examples you've highlighted in the BCR search seem to be related to grpc? So perhaps there is a nicer way to solve this in rules_proto or similar that we can then document. The only requirement for grpc generated code and the runtime in a userspace build is that the protobuf versions are compatible. It doesn't really require that the runtime code is using a grpc dependency provided by the grpc BCR module. This is why Im continuing to ask for a realistic example that demonstrates the concrete problem, rather than discussing problems in an abstract way.

The confusion and conflation of build dependencies that are used by bzlmod modules vs application dependencies that become part of a build output is often a point of confusion. The way that build dependencies that happen to use python as part of their extensions, is that they should be isolated as @aignas points out in #2094

It often gets compared to golang dependencies, but thats really just because golang and bzlmod happen to use the same dependency resolution algorithms. Java and Python application dependency resolvers do differ from golang and indeed there isn't a reason for all languages to adopt the golang approach.

mering commented 2 weeks ago

Which concrete implementation to use depends on the final consumer (e.g. the binary / requirements.txt), not the point of need (module A's target needing protobuf). However, an upstream target can't know what its downstream targets's constraints are. i.e., module A's target can't know whether it's module A, module B, or module C that is consuming it. Thus, the point-of-need can't specify what concrete implementation to use. Similarly, the final consumer can't arbitrarily override what every point-of-need says.

I agree but would argue that the final consumer is in the best position to decide as it has all information available.

re: "[libraries] takes part in version resolution": part of how requirements.txt works is you can arbitrarily pin a dependency to a version. It's then up to the resolver to find a closure with that constraint. Solving that is outside the scope and ability of Bazel (solving a dependency closure's constraints can take several minutes sometimes). My point here is that picking which concrete implementation to use isn't simply decided based on python version or OS. It can be arbitrary and out-of-band information.

While it is true that the Bazel and the pip dependency solvers use slightly different mechanics, they are compatible enough in most cases. While not being a perfect fit, I see more problems solved when generating one Bazel repo per dependency as opposed to one Bazel repo per dependency closure.

The way this is handled in golang, is that every dependency is its own Bazel repo and as such takes part in version resolution. This makes sure that every package only exists once within a given Bazel workspace.

I think this is the basic point of misunderstanding -- rules_python supports multiple dependency closures in a single build. This is based on user request -- not every repo is a monorepo with a strict one-version policy as enforced within Google.

One of the main reasons to use Bazel in the first place is to support big (mono-)repos. Otherwise, the overhead is usually not worth it. So I would argue that most users of rules_python (including us) actually want a single dependency closure in a single build. Why would you want multiple dependency closures within a single workspace? Why don't you just create another workspace if you want separate ones?

As a general rule, you can't mix "bazel dependency management" (e.g. implicit deps on library rules) with "external dependency management" (e.g. requirements.txt) when it comes to user dependencies. The two methods have separate dependency closures and aren't aware of each other, and thus can't safely interoperate.

This is not true, see for example the golang example. People are usually using Bazel for a reason and explicitly chose to follow Bazels way of managing dependencies but then need some way to add dependencies from other ecosystems. So they should merge into a single dependency closure in order to safely interoperate.

This is not possible for dependencies added from BCR which add their own Python code and are not available on pypi.org. We do not have control over these dependencies.

In this situation, the best a BCR module can do is:

  • Expose a e.g. py_library that contains the module's code
  • These targets do not depend on (e.g. have py_library.deps edges) to code not owned by the module
  • Expose a requirements.txt snippet, or document the pypi packages needed, that a consumer needs to put in their requirements.txt

I would argue that any target should completely list its dependencies. Requiring addition of transitive packages into an unrelated consumers is an anti-pattern.

A mediocre middle ground might be to do something like:

  • Always generate e.g. an @pypi repo.
  • Allow e.g. pip.override to "register" aliases into it that point elsewhere.
  • Something like grpc then depends on @pypi//protobuf, as a way to say "I need this, but let someone else point the label to something".

This is, effectively, similar to the deprecated bind() function from workspace; it's a form of late binding. However, as I described above, given a repo with multiple requirements.txt, there can be conflicting answers for what the correct and neccesary binding is. (I'll note this is somewhat already possible today by having the implicit dep point to a user-defined flag, or code generated by workspace/bzlmod configuration). Related issue - #2094

This would require the root workspace to know about all the dependencies to manually pick a specific version (similar to the WORKSPACE model). People didn't like it and created Bzlmod as a result.


I'd still value a realistic reproduction in a realistic repo if thats possible please @mering? Most of the examples you've highlighted in the BCR search seem to be related to grpc? So perhaps there is a nicer way to solve this in rules_proto or similar that we can then document. The only requirement for grpc generated code and the runtime in a userspace build is that the protobuf versions are compatible. It doesn't really require that the runtime code is using a grpc dependency provided by the grpc BCR module. This is why Im continuing to ask for a realistic example that demonstrates the concrete problem, rather than discussing problems in an abstract way.

This problem comes up every other week in our big monorepo and usually requires days to figure out the right order of dependencies to work around if possible at all. The last occurrences have been about a transitive dependency to protobuf which hasn't even been specified in our requirements.in file but only generated via pip-compile. One dependency closure or a transitive dependency defined an older version of Protobuf which wasn't compatible with what other parts of our code base assumed. This problem appeared out of nowhere just by adding or removing a dependency or re-ordering dependencies.

The confusion and conflation of build dependencies that are used by bzlmod modules vs application dependencies that become part of a build output is often a point of confusion. The way that build dependencies that happen to use python as part of their extensions, is that they should be isolated as @aignas points out in https://github.com/bazelbuild/rules_python/issues/2094

One can argue at length on what people should be doing but I guess we have to face reality and see how users are actually using it. The more closely we align to the way Bazel handles things, the more intuitive it is for users and the more likely it is that users will use the library the way it is intended.

It often gets compared to golang dependencies, but thats really just because golang and bzlmod happen to use the same dependency resolution algorithms. Java and Python application dependency resolvers do differ from golang and indeed there isn't a reason for all languages to adopt the golang approach.

I think it doesn't bring value to discuss the advantages and disadvantages of various dependency resolvers here. But maybe the Bazel one is good enough for most use cases and the few exceptions could be covered by override mechanisms.


Why not having both? rules_python could continue to generate fixed dependency closures in a repo per dependency closure. Additionally, it could create one repo per dependency which could then use Bazel version resolution scheme which is not perfect but probably good enough for most cases. Users could then pick which repo they want to depend on. WDYT?

aignas commented 2 weeks ago

Just to add my personal take here as I have worked a lot on 3rd party dependency handling.

As @groodt mentioned, I think having a small reproducible example which highlights the short comings of rules_python would be good. So far I've only heard about short comings of protobuf bazel integration - the way python sources are provided could be the main problem here. Is there an XY problem here ref?

EDIT: there is also work by aspect in aspect_rules_py here where they have a way to resolve the dependency tree based on what the terminal targets (i.e. py_binary and py_test) specify, so this is another knob that is missing from rules_python right now, but could technically be implemented if there is willingness from the community to help drive this. See the resolutions attribute in the said rules.