bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
22.99k stars 4.03k forks source link

Consider removing bind() #1952

Closed jart closed 4 years ago

jart commented 7 years ago

What is the use case for bind()? I can't think of one. Even in situations where there are multiple ABI-compatible implementations of a library (e.g. OpenSSL, BoringSSL, etc.) this problem could still be solved by using vanilla externals.

Most Bazel projects don't seem to use bind(). The ones that do, it seems to have caused problems.

For example, the protobuf repository, rather than defining a protobuf_repositories() function, simply uses //external:foo for every single target upon which it depends, thereby punting the burden defining bind() rules not only for every single external, but every target within those externals.

As a result, the TensorFlow workspace.bzl file has developed a cargo cult pattern where superfluous bindings will be added, because I don't think people really understand what bind() does.

What is especially suboptimal is that the bind() namespace overlaps with the external repository namespace. We can't name externals like "six" to be "@six" because the protobuf BUILD file asked us for //external:six. So we don't have a choice. We have to name it @six_archive, which hurts readability. It would have been more optimal if the protobuf BUILD file should have just asked for @six//:six.

It would be nice if we could retire bind() and help projects like protobuf migrate to the foo_repositories() model that official Bazel projects use. We could recommend as a best practice the technique that is employed by the Closure Rules repositories.bzl file.

def closure_repositories(
    omit_foo=False,
    omit_bar=False):
  if not omit_foo:
    foo()
  if not omit_bar:
    bar()

def foo():
  native.maven_jar(name = "foo", ...)

def bar():
  native.maven_jar(name = "bar", ...)

This gives dependent Bazel projects the power to schlep in transitive Closure Rules dependencies using either a whitelist or blacklist model. For example, one project that uses Closure Rules has the following in its WORKSPACE file:

http_archive(
    name = "io_bazel_rules_closure",
    sha256 = "7d75688c63ac09a55ca092a76c12f8d1e9ee8e7a890f3be6594a4e7d714f0e8a",
    strip_prefix = "rules_closure-b8841276e73ca677c139802f1168aaad9791dec0",
    url = "http://bazel-mirror.storage.googleapis.com/github.com/bazelbuild/rules_closure/archive/b8841276e73ca677c139802f1168aaad9791dec0.tar.gz",  # 2016-10-02
)

load("@io_bazel_rules_closure//closure:defs.bzl", "closure_repositories")

closure_repositories(
    omit_gson = True,
    omit_guava = True,
    omit_icu4j = True,
    omit_json = True,
    omit_jsr305 = True,
    omit_jsr330_inject = True,
)

Because it directly depends on those transitive dependencies and wants to specify them on its own.

I think this is a much more desirable and flexible pattern than bind().

philwo commented 7 years ago

I personally also find bind(...) confusing and am not sure how to correctly use it.

Pinging @damienmg and @lberki for some input here, I think they know more about how this is supposed to work.

lberki commented 7 years ago

bind was indeed invented for selecting between e.g. different implementations of SSL or e.g. for different versions of GSON/Guava/... in the Closure case. I'm not sure how much use it sees, so I'd not be that trigger-happy with removing it.

damienmg commented 7 years ago

I think all bind uses-cases can be replaced with alias but it does not strike me as high priority. bind() has that weird thing that it creates a //external package that does not really exists, so only for that I would say +1

jart commented 7 years ago

It might be worth putting a quick change in the documentation saying that bind() is or will likely be deprecated and that alias should be used instead, or simply vanilla external repository names. That should hopefully lessen any refactoring we'll have to do in the future, if it is removed.

lberki commented 7 years ago

Considering that I'm not really sure we want to do that, I'd rather not deprecate bind() just now.

jart commented 7 years ago

Then would Bazel at the very least be open to a documentation change dissuading people from using it?

On Oct 18, 2016 1:45 AM, "lberki" notifications@github.com wrote:

Considering that I'm not really sure we want to do that, I'd rather not deprecate bind() just now.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/1952#issuecomment-254444915, or mute the thread https://github.com/notifications/unsubscribe-auth/AADAbqUFcPK40C1qQCoJ2B3pyKrI3LEIks5q1IcOgaJpZM4KYF52 .

robertcrowe-zz commented 7 years ago

It looks like this could be related to a problem I'm trying to solve. When I try to build my Tensorflow Serving client I'm getting this:

bazel build //FFv1:client ERROR: /home/rcrowe/.cache/bazel/_bazel_rcrowe/b1867dd6bfd6249a885c91482eccde46/external/org_tensorflow/tensorflow/workspace.bzl:80:3: no such package '@six_archive//': In new_http_archive rule //external:six_archive the 'build_file' attribute does not specify an existing file (/home/rcrowe/serving/six.BUILD does not exist) and referenced by '//external:six'. ERROR: Analysis of target '//FFv1:client' failed; build aborted. INFO: Elapsed time: 1.286s

Any idea how I can fix this? I'm on Ubuntu 14.04

jart commented 7 years ago

@robertcrowe Did you modify the workspace.bzl file? Can you start a separate issue for this and CC me?

robertcrowe-zz commented 7 years ago

@jart - Thanks, I created #1963

kchodorow commented 7 years ago

+1 to removing bind, updating the docs in the meantime seems like a good idea.

jart commented 7 years ago

I reviewed the code to rules_web yesterday, which was recently introduced and makes extensive use of bind(). It uses bind() to allow the user to override executable attributes on its Skylark rules without having to repeat them every time the rule is used. I had a discussion with @DrMarcII about this. We both came to the conclusion that it would be better if those attributes were public and the user defined macro wrappers that customize the attribute.

I'm glad we're building consensus around removing bind(). Right now it's the first rule listed in the documentation, but it's actually the last rule we'd want to use. The same is true for git_repository(), which is also listed first, but has the biggest negative impact on performance for the project in question, and all dependent projects. Many users make use of these rules without considering the alternatives, like grabbing the snapshot tarball from GitHub. Whatever we can do to help the user make the correct choices, especially if we make the wrong choices impossible, is going to go a long way to fostering a healthy and lightning fast build ecosystem spanning many GitHub projects that all reference each other.

damienmg commented 7 years ago

The order of the rules are a bit random, I don't think they matter.

Anyway, IMO:

  1. we should advertise bind as deprecated and point user to alias or use the repository name instead,
  2. we should starts advertising maven_jar and gitrepository as deprecated and point the user to http* rules (for source distribution) and the skylark implementations (for those who git / maven support is better).
  3. We should publish a best practice document about the good use of external repositories. A blog post maybe?

On Wed, Oct 19, 2016 at 8:36 PM Justine Tunney notifications@github.com wrote:

I reviewed the code to rules_web https://github.com/bazelbuild/rules_web yesterday, which was recently introduced and makes extensive use of bind(). It uses bind() to allow the user to override executable attributes on its Skylark rules https://github.com/bazelbuild/rules_web/blob/master/web/internal/web_test_config.bzl#L63 without having to repeat them every time the rule is used. I had a discussion with @DrMarcII https://github.com/DrMarcII about this. We both came to the conclusion that it would be better if those attributes were public and the user defined macro wrappers that customize the attribute.

I'm glad we're building consensus around removing bind(). Right now it's the first rule listed in the documentation, but it's actually the last rule we'd want to use. The same is true for git_repository(), which is also listed first, but has the biggest negative impact on performance for the project in question, and all dependent projects. Many users make use of these rules without considering the alternatives, like grabbing the snapshot tarball from GitHub. Whatever we can do to help the user make the correct choices, especially if we make the wrong choices impossible, is going to go a long way to fostering a healthy and lightning fast build ecosystem spanning many GitHub projects that all reference each other.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/1952#issuecomment-254902271, or mute the thread https://github.com/notifications/unsubscribe-auth/ADjHfzgrJ6tRZ8DwEjXfpJr2Yu-g31dhks5q1mMugaJpZM4KYF52 .

steven-johnson commented 7 years ago

On Thu, Oct 20, 2016 at 12:54 AM, Damien Martin-Guillerez < notifications@github.com> wrote:

  1. We should publish a best practice document about the good use of external repositories.

+1

A blog post maybe?

Sure, but please also mirror it into the official docs somehow; people arriving later may never see the blog post.

ittaiz commented 7 years ago

@damienmg why are maven_jar and git_repository being deprecated? additionally the whole "new" prefix is a bit misleading IMHO since in this doc I got the impression that both local_repository and http_archive have a different use-case than the new_local_repository and new_http_archive and not just replacing them.

damienmg commented 7 years ago

They are being replaced with skylark implementation from @bazel_tools//tools/build_defs/repo:git.bzl and @bazel_tools//tools/build_defs/repo:maven_rules.bzl

new_* is indeed not the replacement version

On Tue, Nov 1, 2016 at 11:02 AM Ittai Zeidman notifications@github.com wrote:

@damienmg https://github.com/damienmg why are maven_jar and git_repository being deprecated? additionally the whole "new" prefix is a bit misleading IMHO since in this doc https://www.bazel.io/versions/master/docs/external.html#depending-on-non-bazel-projects I got the impression that both local_repository and http_archive have a different use-case than the new_local_repository and new_http_archive and not just replacing them.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/1952#issuecomment-257529037, or mute the thread https://github.com/notifications/unsubscribe-auth/ADjHf54T56FF5IWlWzFgjgi52ekMGonhks5q5w4ogaJpZM4KYF52 .

aj-michael commented 7 years ago

I don't see it mentioned is this thread, so I thought I should mention, the AndroidSdkRepositoryFunction makes use of bind so that android_binary can depend on the android_sdk generated by AndroidSdkRepositoryFunction without needing to know the name of the android_sdk_repository rule.

kchodorow commented 7 years ago

Which I think is a mistake, as I think I'm mentioned before. The android_sdk rule still depends on the external name, which is no better that depending on the repository name.

aj-michael commented 7 years ago

Sorry, what do you mean by "the android_sdk rule still depends on the external name"? Do you mean that the name of the android_sdk rule is the name of the android_sdk_repository rule? Or that the "bound" name refers to the external name?

kchodorow commented 7 years ago

Whoops, I mistyped, I meant the android_binary rule. The android_binary rule depends on something like //external:android_sdk, it would be better for it to depend on @android_sdk//jar or something.

aj-michael commented 7 years ago

Hmmm, why would that be better? that would require that every developer name their android_sdk_repository "android_sdk". The current advantage of using //external:android/sdk is that android_binary still works no matter what you name your android_sdk_repository.

I agree with you that android_binary depending on an external bind is not ideal. Maybe this is not the right place to discuss this, but could we remove the name attribute of android_sdk_repository? If we could hardcode the name of android_sdk_repository to something, then we could stop using bind in Android land.

kchodorow commented 7 years ago

You are already requiring every developer to declare an android_sdk_repository and bind it to a certain name: this eliminates one of those steps.

Also, you could make it a macro and set the name in the macro, e.g.,

def my_android_sdk():
  native.android_sdk(
    name = 'android_sdk',
    ...
  )
aj-michael commented 7 years ago

I don't think we are. The developer does not have to use a bind() in their WORKSPACE. We create the bind under the hood: https://github.com/bazelbuild/bazel/blob/0f0f383e8c5b51258157f7c34368f2209d870854/src/main/java/com/google/devtools/build/lib/bazel/rules/android/AndroidSdkRepositoryRule.java#L57.

E.g., the following works:

$ cat WORKSPACE
android_sdk_repository(
    name = "foobar",
    path = "/home/ajmichael/Sdk",
    api_level = 25,
    build_tools_version = "25.0.0",
)
$ cat BUILD
android_binary(
    name = "app",
    manifest = "AndroidManifest.xml",
    custom_package = "com.example"
    srcs = glob(["**.java"]),
)
$ bazel build //:app

I guess if we remove the user-visible bind() function but keep the underlying functionality for native repository rules, android would be fine.

I'm going to open another issue to address your comment about the macro, because I like that idea a lot. :smile:

johnynek commented 7 years ago

I'm -1 on removing or deprecating bind. Suppose you have two external bazel repository dependencies, A, B. Each of which needs dependency foo. One of them expects it at @foo_label_in_a and one expects it in @foo_label_in_b. Then I have two equivalent dependencies that look different to bazel. There is no way (nor even standard convention as far as I can see) on what to name dependencies as a function of their semantic identity.

I think the right way to go here is that A expects a binding to say a_bind_of_foo and B expects b_bind_of_foo then when I depend on both, I set up the bindings appropriately.

I wish the bazel core team had someone whose role was explicitly to advocate for the many external repo use case (hopefully many of them bazel). That is the use case of the vast numbers of people using most existing tools (and migration to monorepo as a precondition to use bazel is not so workable, especially in open source: what would that even mean?_

jart commented 7 years ago

All the projects I maintain use the following algorithm for standardizing the naming convention of Maven repositories: https://gist.github.com/jart/41bfd977b913c2301627162f1c038e55

If two projects can't agree on how to name a repository, then wouldn't alias() be a viable workaround? Introducing bind() would mean you now have two problems: two projects can't agree on two parallel names for the same dependency.

johnynek commented 7 years ago

@Jart I agree that when you are dealing with your own code, using one format is fairly straight forward. The problem is that sharing with folks that don't agree. For instance: bazel just uses guava at a certain path. If I want to use a bazel target, bazel targets are already expecting //third_party:guava as the target. How do I alias //third_party:guava to my //3rdparty/jvm/com/google/guava? Can you show how? It looks like I can point my target to bazel, but since bazel is an external project, I can't repoint the versions it is on, so I either have to accept that version or have two versions on the classpath.

This problem becomes more pronounced when composing larger numbers of repositories.

If instead, bazel expected //external/jvm/com/google/guava to be a binding of the currently in play jar, and then you bind that to your local //thirdparty/guava in your workspace, then in my workspace, I could bind whatever version of guava I am using to the same location, and the bazel targets would see those and use them.

My actual use case is for scala thrift generation: scrooge (https://github.com/twitter/scrooge). The rules have a run-time dependency on some version, but that dependency is super weak and almost any version will do in practice. The user should be able to configure the version of scrooge they need, the bazel plugin should not set the version. I don't see a clean way currently to do this without bind. Am I missing something?

jart commented 7 years ago

The only way to have dependencies be composable across Bazel repositories that reference each other is to write files that look like Nomulus' repositories.bzl file. Then other projects can depend on Nomulus using the same technique Nomulus' WORKSPACE uses to depend on Closure Rules. You'll notice that Closure Rules' repositories.bzl has an overlapping set of dependencies with Nomulus. Hence all that omit_foo boilerplate.

I've been planning to write a piece for the Bazel blog explaining this best practice, as well as convert more projects to it, but haven't had the time. It's also quite verbose. But I guarantee you that when builds get written that way, they'll be faster and more reliable than anything else. But maybe someday we can add a feature to Bazel that makes this best practice require fewer lines.

As for twitter/scrooge, that project appears to be using Pants. Are you planning to rewrite their build in Bazel?

johnynek commented 7 years ago

The scala rules (which I help maintain) have scrooge support: https://github.com/bazelbuild/rules_scala/blob/master/twitter_scrooge/twitter_scrooge.bzl#L9

notice we have a number of versions set. That is not where they should be set. The consumer of the plugin should set the version, not the plugin itself ideally.

I don't think the only way to have composable dependencies to lock them to one central repositories.bzl. I think you could have something like this:

the scala rules could say, I need //external:io_bazel_rules_scala_scrooge_jar etc... I list the symbolic dependencies that consumers of the rule need to set up. Then in the callers, they can bind that name their local choice of what jar they need.

It is very challenging to satisfy all the dependency requirements in a large repo to begin with. By having external bazel plugins force more constraints it could become almost impossible.

As for using external maven deps with the same naming convention, at Stripe we are using a tool I wrote: https://github.com/johnynek/bazel-deps

It is working great. You declare a list of dependencies and it generates a lock-file of all the shas and the maven coordinates you need. So, all our local repos can work together. The friction point comes when we have external rules and deps we don't control.

To me, bind is perfectly suited to this use case. I don't see why it should be removed since it seems like it will actually work, but nothing else is there to replace it except the suggestion to centralize how dependencies are done, which I don't think is realistic in the OSS world.

jart commented 7 years ago

That repositories.bzl design is not centralized or locked in. Please take another look at the files I linked in my previous comment.

Assume I want my Bazel project to depend on Nomulus. I can do this with a whitelist or blacklist model.

If I want to use Guava in my own build rules, I just say deps = ["@com_google_guava"] and it links Guava along with its transitive dependencies. It's a global name. Every java_library in the world that wants to depend on Guava should do so using that exact same name, because the goal is to have a one version policy across repositories.

There's no need for delegate build files like this.

There's no need for bind().

pcj commented 7 years ago

The system I've been using employs a require function in conjunction with a list of deps encoded as dict objects (https://github.com/pubref/rules_maven/blob/master/maven/internal/require.bzl).

The require function tests for the existence of the rule via native.existing_rule. If it is registered, assert that the requested version matches the pre-existing value.

This way the rule can be called by multiple repositories without having to say omit_*, but the versions must still match. Otherwise, it one can omit it completely it with an exclude clause, or override specific fields by virtue of the dict + operator (https://bazel.build/versions/master/docs/skylark/lib/dict.html).

Whether this implementation is the optimal one is an open question, but it would be nice to have a similar blessed function within the bazel repo itself for others to use.

I don't have a strong opinion on bind, but I do think @ekuefler's setup of rules_gwt is a good example of bind done well.

The non-standarized approach to workspace names and dependencies needs clarity though, and will slow bazel adoption.

johnynek commented 7 years ago

@jart Perhaps I am not being precise enough. When I say centralized I mean this:

If you want to use Nomulus, then you accept their naming scheme for maven artifacts. It is not clear exactly that that is. Can you point to any documentation on that? That is the centralization I speak of: if we all centrally agree (and hopefully validate with tooling) on some convention this will work. By contrast, bind allows us to retain a distributed view on the bazel target id -> maven coordinate mapping.

What exactly are the issues?

  1. The main one is that nomulus seems somewhat ad-hoc (as I imagine many systems will have ad-hoc rule violations). For instance, in nomulus the google guava project is: "@com_google_guava" yet the maven coordinate is: "com.google.guava:guava" Is there some rule about dropping repeated strings on the end? There are several cases where the naming is not some simple find and replace of the maven coordinate (for instance "com.google.api-client:google-api-client"). I am sorry, I didn't take time to catalog the entire list of these but many of them have some slight deviation from the maven coordinate naming). If you have two such repos you depend on that behave as nomulus does, what is the best practice? Just wire the jars in multiple times (duplicate external repos, once for each name). bind allows you to work around since each repo does not assume its maven names are valid global identifiers.

  2. There is no standard, safe mapping from maven coordinate to bazel repository name. What people sometimes do is map any special character (like - or :) to _, but this has the potential to create collisions in the namespace. For instance "com.foo.bar:baz" is mapped to the same bazel name as "com.foo:bar-baz". This may rarely be an issue, but it can be. This should not be good enough to meet bazel's high build correctness standards. bind can help you work around since you can then drop the assumption of a global identifier.

  3. Repository naming in general, beyond just speaking of maven repos, is a challenge. It has been proposed to name WORKSPACES with the jvm style: com_google_foo_bar etc. Suppose from that same project you wish to publish a jar "com.google:foo-bar". Now we have created a collision between the maven coordinate and the bazel repo. URLs attempt to solve this problem with namespacing and ports. We have not seen anything like this for bazel that I know of. One can imagine some standard prefixes: bazel_ for a bazel repo, maven_jar_ for a jar with a name derived from a maven coordinate, etc... It may be that the restricted set of strings bazel uses for external repo names should be expanded somewhat.

As @pcj comments, the approach here: https://github.com/bazelbuild/rules_gwt/blob/master/gwt/gwt.bzl#L425 sidesteps these issues.

Lastly, I want to make the analogy to filesystems. Correct me if I am wrong, but bind is basically like a symbolic link on a filesystem. Strictly speaking, of course, symbolic links are not required. You can copy files or everyone can agree on canonical locations. But I would argue that symbolic links give a lot of flexibility to the design space at a fairly minimal cost. In the absence of any evidence that we can get naming globally correct (we certainly have not seen it), I really don't understand the motivation to remove this tool (bind) for flexibility.

Maybe after we have solved the global naming issues, bind will truly be unneeded. For instance, I have never needed anything personally like this in the maven world, but in maven the naming problem is pretty much solved (modulo some corner cases about non-uniqueness of class -> artifact).

ekuefler commented 7 years ago

I'm not particularly happy with the use of bind in the GWT repo. It works but is very all-or-nothing: everything is easy if you're fine with the defaults and can call gwt_repositories(), but if you want to change the version of GWT or any of its dependencies you're out of luck and need to manage every transitive dependency yourself via bind. I believe this also means that, by default using gwt_repositories(), you'll have a duplicate copy of any of those dependencies you happen to use yourself in your own project, which must be downloaded separately and might lead to exciting classpath issues (though maybe Bazel does some caching and de-duping to mitigate this).

I like the end result of what @jart is illustrating with Nomulus. The blacklist/whitelist approach seems like an improvement over the bind strategy in the GWT rules, though there are potentially some naming issues. The central issue seems to be how to reliably map maven coordinates to Bazel repo names, and how to fit in artifacts that aren't in maven. Maybe it's a matter of establishing strong conventions or maybe tooling that takes maven artifacts should generate names somehow to enforce consistency. And the amount of boilerplate required to implement the Nomulus strategy seems prohibitive; it would be cool if Bazel provided some dedicated features for this.

Using bind to manage the dependencies of a shared library-like project is one thing, but how bind is used in internal projects that won't be used by others is another (probably less important) issue. The strategy at my company is to disallow the use of @repo-style deps anywhere in application code. Any external artifacts that are to be used by code must be referenced via the external namespace as defined by a bind rule, so //external:guava is bound either to the guava jar if it has no dependencies or to a dummy java_library that exports guava and declares its dependency as runtime deps.

I actually somewhat dislike this since it means we have a very large WORKSPACE and top-level BUILD file containing all external dependencies and their transitive dep information, which people need to touch whenever they're modifying external dependencies. Were I to do this again I would probably make a third_party directory with a subdirectory per dependency containing only a BUILD file defining that dependency's transitive deps to better modularize things. So the result would look the same to the application except they'd refer to //third_party:guava instead of //external:guava. So overall I think there are better options for all of my current (fairly extensive) usages of bind, and I'd be alright with deprecating it.

johnynek commented 7 years ago

Yes, I agree that bind is not elegant, or beautiful, but it seems workable.

@ekuefler we also don't use @foo names in the BUILD (outside of 3rdparty directly, which have targets that set up the correct runtime and compile time classpaths that people need to depend on external targets. So, then they do: //3rdparty/jvm/com/google/guava:guava for instance. The path is the maven group, and the target in the build is the maven id.

To describe our use case at Stripe: we have many maven dependencies (several hundred now) with complex interdependencies (you can't use hadoop deps without a giant web of interdependencies of apache projects). We also have a DAG of bazel repos that depend on each other, and want to share maven dependency names. Internally, we can use the same names for things with tooling, but we don't use the same as nomulus, for instance (since we just naively compute the name from the maven name to try to minimize collisions).

It really feels to me that we need more work on the external repo story. It is also related to publishing, since publishing targets is kind of the dual problem: creating external dependencies.

jart commented 7 years ago

@johnynek Nomulus, Closure Rules, etc. have adopted the following naming algorithm for Maven artifacts: https://gist.github.com/jart/41bfd977b913c2301627162f1c038e55

var CLEANSE_CHARS_ = new RegExp('[^_0-9A-Za-z]', 'g');

/**
 * Turns Maven group and artifact into Bazel repository name.
 *
 * <p>This algorithm works by turning illegal characters into underscores and
 * then eliminating redundancy. For example:
 *
 * <ul>
 * <li>com.google.guava:guava becomes com_google_guava
 * <li>commons-logging:commons-logging becomes commons_logging
 * <li>junit:junit becomes junit
 * </ul>
 *
 * @param {string} group Maven group ID.
 * @param {string} artifact Maven artifact ID.
 * @return {string} Recommended name for Bazel external repository.
 */
function getName(group, artifact) {
  var left = group.replace(CLEANSE_CHARS_, '_');
  var right = artifact.replace(CLEANSE_CHARS_, '_');
  var p = -1;
  while (p < right.length) {
    p = right.indexOf('_', p + 1);
    if (p == -1) {
      p = right.length;
    }
    var chunk = right.slice(0, p);
    if (left == chunk) {
      return right;
    }
    chunk = '_' + chunk;
    if (left.slice(-chunk.length) == chunk) {
      left = left.slice(0, -chunk.length);
      break;
    }
  }
  return left + '_' + right;
}

You are correct that it is possible for two different Maven artifacts to end up with the same name when run through this algorithm. You have a keen eye for spotting edge cases. I believe this is an acceptable tradeoff in order to make the names look nice. I feel that the solution is human review. These names would be created by a tool that generates a Bazel config by crawling the Maven repository. Then the developer would tune things accordingly.

I've created a website that will generate these configs automatically. It looks like this: http://i.imgur.com/MuIzgcG.png Right now I'm going through the red tape required to launch it. This tool is going to make life so much better for Bazel Java users. You're going to be so happy.

You'll notice that the generated code uses java_import_external rather than maven_jar. That repository rule is defined here: https://gist.github.com/jart/70bdc88e662a5078a7d8682e5411ae8c This rule will be contributed to the Bazel codebase soon. But you can start using it today if you copy that code into your codebase. The reason why you should consider using this rule is because it allows the WORKSPACE file to define the dependency relationships. That way you don't need delegate BUILD files. Check out how much code Nomulus was able to delete thanks to java_import_external: https://github.com/google/nomulus/commit/734130aa73cba9e82bca51075224a4bdc89a74a0

As for multiple Bazel projects agreeing on naming, I suspect this is mostly a diplomacy problem.

My goal right now is to build consensus in the Bazel community. In order to eventually build a consensus, I've been putting so much work into improving Bazel core so we can have an incredible way to define external dependencies. I modified Bazel so @foo//:foo can be written as @foo in https://github.com/bazelbuild/bazel/commit/61affe77f9f5e5e7fcbbba7d8c2bcf8fbab776d5. I added exponential backoff retries to downloads in https://github.com/bazelbuild/bazel/commit/7f8e0456efe58711aae98c95c6a9dfb57824b9c2. I added a feature for redundant failover URLs in https://github.com/bazelbuild/bazel/commit/ed7ced0018dc5c5ebd6fc8afc7158037ac1df00d.

The reason why I want to build consensus around not using bind() is because Google designed Blaze to have O(n) dependencies. (Cf. NPM where dependencies are O(n^2).) Google accomplishes this through one version policy and company-wide cooperation in a single monolithic repository. It's challenging for us to all cooperate in this shared environment, where upgrading a third party dependency means potentially breaking the builds of hundreds of teams. But it saves a lot of collective effort at the end of the day. I'm not sure how much of this model will be able to carry over into the Bazel world. But I'm trying the best I can to make that the case, because I feel like this is how Blaze (and therefore Bazel too) was designed to be.

johnynek commented 7 years ago

@jart thanks for taking the time for the detailed reply.

I'm glad you are working on the developer experience for external deps. It seems your tool does something very similar to my bazel-deps tool: https://github.com/johnynek/bazel-deps

the difference is that we check in the input to that tool: https://github.com/johnynek/bazel-deps/blob/master/dependencies.yaml

and do a transitive resolve anytime there is an update. The resulting versions are checked into one file: https://github.com/johnynek/bazel-deps/blob/master/3rdparty/workspace.bzl

And the third party dependencies are automatically generated and not edited by humans: https://github.com/johnynek/bazel-deps/blob/master/3rdparty/jvm/com/google/guava/BUILD

Users need to add a line to that yaml file in any repo using this tool to change the deps. Then we run a full transitive walk again to commit to a single version for each jar. Of course, a small change can change many dependencies (such as is known from, for instance: https://research.swtch.com/version-sat "Dependency hell is NP-complete").

Will your tool support scala? That is, how will it deal with preventing different scala version jars from landing on the classpath together? How can you upgrade the whole repo from one version to the next (manually updating all versions can be a massive pain). At Stripe we are primarily using bazel for scala (although there is some small amount of legacy java).

I'm 100% behind having one version of each dependencies (O(n) vs O(n^2) as you say). I don't see that as exactly related to bind. I want a single version of each jar. That is why I want bind. Otherwise, we have to may all tools agree on the naming scheme (as we have discussed). For instance, yours trades some precision for some ergonomics. I would not make that trade. External dependencies change rarely enough that I would prefer to retain the invariant that no two maven coordinates are mapped to the same target. A little extra verbosity will not be the first case in a bazel build (which is already FAR more verbose than most maven or sbt setups would be).

jart commented 7 years ago

https://github.com/johnynek/bazel-deps

This tool is a very good implementation of the generated code approach. Your yaml file is elegant and maintainable. The best part of your design is that it doesn't have to download and crawl hundreds of pom.xml files as part of the build process, as a tool like Maven would do. So your builds go very fast.

We considered the generated code approach. We decided against it because we always found ourselves needing to tweak the generated definitions.

We started off by using that website I mentioned to generate the config. We needed to tune the generated code to make sure neverlink, testonly, and licenses are all correct. When the website resolved diamond dependencies, it would bump up versions for some App Engine libraries, but not others, so we had to make sure those all ended up being the same. The biggest post-generation tuning was for annotation processor libraries, e.g. Google Auto. We had to add all sorts of stuff to those rules, which a tool wouldn't be able to infer from pom.xml files. I understand a lot of this stuff could be incorporated into the tool and defined in the yaml file. But we felt it would lead us down a path of diminishing returns where coding had to be indirect with a progressively more complicated tool.

We would rather just have the freedom to tune the files by hand, and then write additional tools to maintain it. For example, we have scripts that run through our repositories.bzl file to make sure all those mirror URLs exist and checksums are correct. We're eventually going to write CI tooling that goes in there and bumps up versions automatically if tests pass. It's more of an assisted coding approach than a generated approach.

We also felt the convenience of having a website to assist us in creating these definitions was pretty cool.

Will your tool support scala?

I don't use Scala but I want to do everything I can to make sure that my efforts serve the interests of Scala developers. For example, I've contributed pull requests to rules_scala in the past and I hope to do so again in the future.

I'm 100% behind having one version of each dependencies (O(n) vs O(n^2) as you say). I don't see that as exactly related to bind. I want a single version of each jar.

Touché. But I'd still argue that it's kind of related to the same concept. If each dependency exists once, then why not have exactly one name for it?

johnynek commented 7 years ago

@jart yes, I know what you mean about the exceptions. Scala does not do so much with annotation processing, but macros are similar, and dealing with that has caused some exceptions. So far, we deal with that by flags in the yml, but having the BUILD language might be nice. (side note: it would be cool to use skylark/build language as a library from other jvm programs).

So, we are somewhat close to being in agreement it sounds like to me:

  1. we both agree it would be preferable to have 1 canonical way to refer to maven jars (and indeed each kind of external dependency).

  2. using that above, we have no need for bind because we all agree about the names.

Where we disagree maybe is:

A. whether bind should be removed until a formalized way to refer to maven jars is adopted as standard.

B. whether that naming scheme should not create collisions in the namespace.

I'd love to see a design doc on naming schemes for external repos. I argue that they should be universal and collision-free. I think using URL/URI as inspiration we could make good progress. As a strawman, I would also suggest we could go beyond the normal bazel names for repository names. Something like: "<maven:com.google.guava:guava>" might work: we have a uri inside <> and we have a prefix until a colon in this case maven but it might also be bazel-repo or npm etc... Each repository type would have its own rules for encoding the rest of the string.

jart commented 7 years ago

If you're interested in having the Skylark language be spun out into its own Java library which can be used independently of Bazel, I would recommend filing a new issue requesting that feature and then CC the Lord of Skylark @laurentlb.

ittaiz commented 7 years ago

+1 for that On Tue, 17 Jan 2017 at 21:32 Justine Tunney notifications@github.com wrote:

If you're interested in having the Skylark language be spun out into its own Java library which can be used independently of Bazel, I would recommend filing a new issue requesting that feature and then CC the Lord of Skylark @laurentlb https://github.com/laurentlb.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/1952#issuecomment-273273940, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUIF6lEdLULDOnz4ABqPIjWmyibN-3Gks5rTRdQgaJpZM4KYF52 .

johnynek commented 7 years ago

2367 added. Thanks!

abergmeier-dsfishlabs commented 7 years ago

I think all bind uses-cases can be replaced with alias but it does not strike me as high priority. bind() has that weird thing that it creates a //external package that does not really exists, so only for that I would say +1

Trying to use alias in WORKSPACE file with latest master...

alias cannot be in the WORKSPACE file
kchodorow commented 7 years ago

You can't use it in the WORKSPACE file, but you could create an alias in a BUILD file that referred to an external target.

alexeagle commented 7 years ago

FYI I just used bind() today because the docs made it look shiny. +1 for updating the docs now to avoid more and more usages.

johnynek commented 7 years ago

Today I had a use case for bind:

In the scala rules the binary version of scalatest is configured with bind (the rules expect the target of the jar for scalatest to be at a particular bind location). I was able to set that to scalatest 3.0.1 and build the code without others having to take that change.

Note, 3.0.1 has different transitive dependencies, so I could bind to a java_library that has exports, and everything worked great.

I still don't see a super clear way to do this otherwise without also having me pass in a bunch of runtime and compile time deps into some function that becomes a repo-rule that each repo using the scala rules sets up.

jart commented 7 years ago

I was able to set that to scalatest 3.0.1 and build the code without others having to take that change. Note, 3.0.1 has different transitive dependencies

@johnynek If rules_scala was using java_import_external and the omit_foo pattern, then could what you described have been accomplished as follows?

http_archive(
    name = "io_bazel_rules_scala",
    urls = [
        "http://bazel-mirror.storage.googleapis.com/github.com/bazelbuild/rules_scala/archive/d916599d38de29085e5ca9eae167716c4f150a02.tar.gz",
        "https://github.com/bazelbuild/rules_scala/archive/d916599d38de29085e5ca9eae167716c4f150a02.tar.gz",
    ],
    sha256 = "391cae2055c9e3bebdb2a6ce06157408e4831b1846043c48c648c79380b4de66",
    strip_prefix = "rules_scala-d916599d38de29085e5ca9eae167716c4f150a02",
)

load("@io_bazel_rules_closure//closure/private:java_import_external.bzl", "java_import_external")
load("@io_bazel_rules_scala//scala:scala.bzl", "scala_repositories")

# upstream rules_scala depends on scalatest v2.2.6
# we want to swap it with scalatest v3.0.1
scala_repositories(
    omit_org_scalatest_2_11 = True,
)

java_import_external(
    name = "org_scalatest_2_11",
    licenses = ["notice"],  # the Apache License, ASL Version 2.0
    jar_sha256 = "3788679b5c8762997b819989e5ec12847df3fa8dcb9d4a787c63188bd953ae2a",
    jar_urls = [
        "http://maven.ibiblio.org/maven2/org/scalatest/scalatest_2.11/3.0.1/scalatest_2.11-3.0.1.jar",
        "http://repo1.maven.org/maven2/org/scalatest/scalatest_2.11/3.0.1/scalatest_2.11-3.0.1.jar",
    ],
    deps = [
        "@org_scala_lang_scala_compiler",
        "@org_scala_lang_scala_library",
        "@org_scalactic_2_11", # Not a dependency of v2.2.6
        "@org_scala_lang_scala_reflect",
        "@org_scala_lang_modules_scala_xml_2_11",
        "@org_scala_lang_modules_scala_parser_combinators_2_11", # Not a dependency of v2.2.6
    ],
)

# upstream rules_scala does not define this transitive dependency
java_import_external(
    name = "org_scalactic_2_11",
    licenses = ["notice"],  # the Apache License, ASL Version 2.0
    jar_sha256 = "d5586d4aa060aebbf0ccb85be62208ca85ccc8c4220a342c22783adb04b1ded1",
    jar_urls = [
        "http://repo1.maven.org/maven2/org/scalactic/scalactic_2.11/3.0.1/scalactic_2.11-3.0.1.jar",
        "http://maven.ibiblio.org/maven2/org/scalactic/scalactic_2.11/3.0.1/scalactic_2.11-3.0.1.jar",
    ],
    deps = [
        "@org_scala_lang_scala_compiler",
        "@org_scala_lang_scala_library",
        "@org_scala_lang_scala_reflect",
    ],
)

# upstream rules_scala does not define this transitive dependency
java_import_external(
    name = "org_scala_lang_modules_scala_parser_combinators_2_11",
    licenses = ["notice"],  # BSD 3-clause
    jar_sha256 = "0dfaafce29a9a245b0a9180ec2c1073d2bd8f0330f03a9f1f6a74d1bc83f62d6",
    jar_urls = [
        "http://repo1.maven.org/maven2/org/scala-lang/modules/scala-parser-combinators_2.11/1.0.4/scala-parser-combinators_2.11-1.0.4.jar",
        "http://maven.ibiblio.org/maven2/org/scala-lang/modules/scala-parser-combinators_2.11/1.0.4/scala-parser-combinators_2.11-1.0.4.jar",
    ],
    deps = ["@org_scala_lang_scala_library"],
)
johnynek commented 7 years ago

It could have been indeed @jart, but currently I solved it this way:

# use the locally set scalatest
bind(name = 'io_bazel_rules_scala/dependency/scalatest/scalatest', actual = '//3rdparty/jvm/org/scalatest')

Since we have huge and complex dependency graphs of external code, we have to have tooling already to handle resolving them.

If you remove bind, we have to significantly retool around this (or stick on old versions of bazel until we find cycles to migrate).

johnynek commented 7 years ago

PS: if this abbreviated maven coordinate approach (removing redundancies in group and artifact) is going to be pushed, has anyone pulled a list of artifacts from maven central to see how many collisions there would be?

I would much rather an encoding from maven coordinate to bazel repo name that is lossless, even is that means using some encoding or considering adding characters to the allowed repository names.

jart commented 7 years ago

If //3rdparty/jvm/org/scalatest exists within this repository, and it also depend on rules_scala, then that would mean multiple versions of the same Scala jar exist within that repository.

Once again, I would advise caution. Developers at Google need to be granted an exception to one version policy (described earlier) in order to do that. We're supposed to have a single label for any given library, which must be a single version. Our lawyers keep a close eye on our //third_party folder to make sure we're doing exactly that (among other things.)

I share this information firstly because we want other companies be successful with Bazel. I believe the best way to do is by plainly stating what Google did, and didn't do, internally with Bazel. Secondly, as evidence that we successfully built a repository where this use case wasn't encountered.

PS: if this abbreviated maven coordinate approach (removing redundancies in group and artifact) is going to be pushed, has anyone pulled a list of artifacts from maven central to see how many collisions there would be?

We're currently doing other meta-analysis of Maven, as part of Operation Rosehub, so that's something we can look into. Thank you for the excellent suggestion.

johnynek commented 7 years ago

We have one version of the jar. What I didn't show was that we could remove some special casing in the tool that does the transitive dependency walk to force it to use the scalatest jar that came from the rules. Removing those exceptions is another nice thing we get but that is a side benefit.

So in addition to the above solution adding 1 line of code, it also deleted code that looked like this:

replacements:
  org.scalatest:
    scalatest:
      lang: java
      target: "@scalatest//jar"

which is a feature of the tool we have discussed earlier in the thread: https://github.com/johnynek/bazel-deps so you can replace a maven dependency with another target (maybe a local bazel build if you have it, or another name).

ittaiz commented 7 years ago

+1 for lossless maven coordinates to bazel workspace name translation (with of course omitting the version) On Fri, 28 Apr 2017 at 4:42 P. Oscar Boykin notifications@github.com wrote:

We have one version of the jar. What I didn't show was that we could remove some special casing in the tool that does the transitive dependency walk to force it to use the scalatest jar that came from the rules. Removing those exceptions is another nice thing we get but that is a side benefit.

So in addition to the above solution adding 1 line of code, it also deleted code that looked like this:

replacements: org.scalatest: scalatest: lang: java target: "@scalatest//jar"

which is a feature of the tool we have discussed earlier in the thread: https://github.com/johnynek/bazel-deps so you can replace a maven dependency with another target (maybe a local bazel build if you have it, or another name).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/1952#issuecomment-297884650, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUIF_ie2kFWEOU7VXpZvaY5Rh9XE7Twks5r0UPygaJpZM4KYF52 .

ronshapiro commented 6 years ago

I'm also +1 on lossless maven -> bazel coordinates. While it would be nice, I think it's hard to establish consistency. I'd rather have things more explicit and then create aliases within the actual project (which is what Dagger does)