bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.22k stars 4.07k forks source link

Create a skylark repository rule for maven artifacts #1410

Closed kchodorow closed 5 years ago

kchodorow commented 8 years ago

Some FRs that have come up in the past:

kchodorow commented 8 years ago
jart commented 8 years ago

Added a proposal sorta related to this in #1733. Feel free to close it out and schlep it into this issue.

aj-michael commented 7 years ago

:+1: for adding transitive deps to this. is there any technical reason that we're aware of that we haven't made transitive deps work yet?

jart commented 7 years ago

It depends on what you mean by transitive deps working. The biggest problem right now I feel is that maven_jar doesn't let one define the dependency relationships. I've fixed this in the java_import_external repository rule which I'll be contributing to Bazel shortly.

I've also built a web GUI which I'm currently seeking approval to launch which will make it easy for users to generate configurations for this rule. The web GUI will read the pom.xml files from the Maven server, resolve transitive and diamond dependencies, and create code that shows you exactly what's going into your project. I feel like this is the best direction for Bazel. It leads to much faster builds which are actually hermetically sealed without magic.

aj-michael commented 7 years ago

By transitive deps working, I mean the rule fetching the dependency relationships from the Maven server and not requiring the developer to specify them.

jart commented 7 years ago

In order to do that in a repository rule, it would probably be necessary to have the rule shade all the transitive jars into the root jar. That means rewriting the transitive class names, rewriting the byte code, and then the code size increases quadratically.

aj-michael commented 7 years ago

Hmmm, I'm not sure I follow. Why would it be necessary to shade the transitive jars and rewrite class names? Perhaps I'm missing something, but the way I would expect it to work would be:

  1. Change the mvn command that we use to download the JAR to also download its transitive dependencies.
  2. Change the maven_jar_build_file_template to create a separate java_import for each of the dependency artifacts and wire these up with exports. These targets would be something like @somemavenjar//jar:dep_on_guava_21.0.
  3. Developer depends on @somemavenjar//jar which exports all of its dependencies.

I don't know how to do 1, but I assume there must be a way since other build tools do this.

jart commented 7 years ago

Having a single remote repository for all the maven jars required by the project, and each individual jar being its own rule within the repository, would avoid the need for shading. E.g. @closure_rules_maven_jars//:com_google_guava. Shading is only necessary if you want to have the same behavior as maven_jar where jars have a 1:1 mapping with repository names.

But doing things that way introduces another problem. What if another Bazel project depends on that Bazel project? It would have to adopt @closure_rules_maven_jars as its container for all its jars, and then redefine the whole thing, in order to put its own jars in there. If it doesn't do that, then we end up with quadratic dependencies again.

jart commented 7 years ago

There's a lot of value to not fetching transitive dependencies auto-magically. For example, with the web gui I just wrote, I generated the following config for com.google.template:soy:2016-08-25. In doing so, I was able to identify a bug in com_google_common_html_types which is depending on Guava Testing Library without declaring it as a test scoped dependency. I was also able to audit the licenses of all my transitive dependencies very easily. But most importantly, by using this config, builds are going to go insanely fast for my users, because calculating that config required downloading 150 things, e.g. pom.xml files. Furthermore, I'm able to effectively mirror my dependencies so builds can be durable and never break.

wstrange commented 7 years ago

@jart The web gui sounds awesome. Are you close to open sourcing it?

jart commented 7 years ago

Expect it at some point in the upcoming months. I need to go through the process. I've also got a lot of other stuff on my plate with TensorFlow.

davido commented 7 years ago

FWIW: Gerrit Code Review project created own version of maven_jar Skylark rule and extracted it to bazlets repository: [1]. It does not use mvn, though.

aj-michael commented 7 years ago

Do any of the skylark maven rules work on Windows?

ittaiz commented 7 years ago

Bazel has a skylark maven_jar rule which uses mvn. Isn't what this ticket is about? Is it open as an aggregate to all the missing features? As in, we have something but it's not mature enough?

ittaiz commented 7 years ago

Is there a benchmark between the native and skylark versions? Sounds like spawning a new mvn per jar can be really expensive when talking about a repo with hundreds or thousands of external dependencies

jart commented 7 years ago

java_import_external is native and will download jars as fast as your internet connection goes. Kristina and I spent a lot of time designing Bazel's native downloader for scalability and 99.9% reliability. For example, bazel fetch on this configuration with 59 downloads happens in four seconds.

ittaiz commented 7 years ago

@jart I might have misunderstood something but the skylar maven_jar version does not use Bazel's native downloader, right? It uses mvn. Are you simply pointing me to a more robust alternative which you trust? In any case I appreciate you taking the time :)

jart commented 7 years ago

maven_rules.bzl farms out downloads to the system mvn command, as you pointed out. native.maven_jar farms out downloads to some third party java library (see MavenDownloader) which is much faster than running the mvn command, but not as robust as Bazel's downloader.

In order to benefit from Bazel's highly advanced downloader, you have to call repository_ctx.download or repository_ctx.download_and_extract in Skylark, or use any of the native workspace rules with the exception of http_jar and maven_jar.

ittaiz commented 7 years ago

ok, that was the missing piece. Thanks!

On Thu, Jun 15, 2017 at 9:38 AM Justine Tunney notifications@github.com wrote:

maven_rules.bzl https://github.com/bazelbuild/bazel/blob/master/tools/build_defs/repo/maven_rules.bzl farms out downloads to the system mvn command, as you pointed out. native.maven_jar farms out downloads to some third party java library (see MavenDownloader https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/bazel/repository/MavenDownloader.java) which is much faster than running the mvn command, but not as robust as Bazel's downloader.

In order to benefit from Bazel's highly advanced downloader, you have to call repository_ctx.download or repository_ctx.download_and_extract in Skylark, or use any of the native workspace rules with the exception of http_jar and maven_jar.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/1410#issuecomment-308642864, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUIF5Dsp1y9_XaWsoHNdWMO60caYct1ks5sENF1gaJpZM4I2X8P .

davido commented 7 years ago

One important aspect for us is not only accelerate the download process and/or make it more robust, but to try very hard to avoid the download and safe network bandwidth in the first place. Gerrit Code Review project has a lot (ca. 150) third party dependencies. There are also more than 100 Gerrit plugins. If you will build all the dependent projects and all plugins, or even if you would clone stable branches in own project directories, you would end up with all currently available Bazel's maven_jar incarnations fetching the same artifacts into different locations hundreds of times!

The only exception is to use gerrit's own maven_jar, that was originally written by Shawn (Gerrit Code Review maintainer) as Gerrit used Buck and was a straight forward port to Bazel Skylark rule. We are staging all (1) downloaded artifacts into ~/.gerritcodereview directory and hard link them to the Bazel's project location. Yes, we do that using curl from a Python script. But doing this for the last 5 years, we've never had any issues with it. As the consequence, if 100 clones of same or different projects use version 42 of artifact A, it would be downloaded only one time and never again.

[1] Unfortunately this is not true any more, since we depend on Bazel's closure rules, we lost the ability using the staging directory feature for 100% of our dependencies. That's because closure rules depends on java_import_external that we do not control. See this commit for more context and background.

jart commented 7 years ago

Bazel's downloader supports the HTTP_PROXY environment variable. Just set that to a Squid proxy running on your network and you're good to go.

kchodorow commented 7 years ago

As this ticket is getting a bit long and convoluted, here's a summary of the state of the maven_jar:

There currently exists several options for maven_jar:

  1. The native maven_jar rule. This does not support auth and uses Maven's own libraries to download jars, which are not quite as reliable nor cachable as the other options.
  2. The @bazel_tools//build_defs/repo/maven_rules.bzl rule that @jin implemented. Downsides are that it requires mvn to be installed and spawns one Maven process per maven_jar rule, which can be very slow. Pros are that it uses Maven directly, so it respects auth/proxy settings you have on your system.
  3. @jart's java_import_external rule: much more flexible than any other option (look at all these attributes and uses the multiplexing downloader @jart wrote to be fast and reliable. Downsides are that it won't pick up on system auth settings. I recommend using this one, if possible.

Things left to do:

jart commented 7 years ago

Thank you for the support. Note for our readers: @foo//:foo can be written as @foo and java_import_external creates a @foo//jar alias.

ittaiz commented 7 years ago

Great recap, thanks!

2 has also the downside of requiring maven to be installed doesn't it?

Any thoughts of bundling @jart's version in Bazel or in a smaller repo? Not sure I want to depend on rules_closure only for this

On Mon, 7 Aug 2017 at 22:44 Kristina notifications@github.com wrote:

As this ticket is getting a bit long and convoluted, here's a summary of the state of the maven_jar:

There currently exists several options for maven_jar:

  1. The native maven_jar rule. This does not support auth and uses Maven's own libraries to download jars, which are not quite as reliable nor cachable as the other options.
  2. The @bazel_tools//build_defs/repo/maven_rules.bzl rule that @jin https://github.com/jin implemented. Downside is that it spawns one Maven process per maven_jar rule, which can be very slow. Pros are that it uses Maven directly, so it respects auth/proxy settings you have on your system.
  3. @jart https://github.com/jart's java_import_external rule: much more flexible than any other option (look at all these attributes https://github.com/bazelbuild/rules_closure/blob/master/closure/private/java_import_external.bzl#L106-L121 and uses the multiplexing downloader @jart https://github.com/jart wrote to be fast and reliable. Downsides are that it won't pick up on system auth settings and it uses a different naming scheme than the others ( @foo//:foo, if I recall correctly, instead of @foo//jar). I recommend using this one, if possible.

Things left to do:

  • Download src jars in 1 & 2.
  • Add an option for downloading the docs jars.
  • Add sha256 as an checksumming option.
  • Support auth in 1 & 3.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/1410#issuecomment-320761024, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUIF6sj2RjdMLlLIdcErqSrE0cqTqoCks5sV2kpgaJpZM4I2X8P .

kchodorow commented 7 years ago

Thanks for the feedback, updated the comment to remove the note about the different naming and added a mention that mvn has to be installed for #2.

@jart actually had a CL adding it to Bazel but I don't think it was ever submitted.

cgrushko commented 7 years ago
  1. Can 3 be used to download srcjars?
  2. Should we deprecate 1 and 2, if 3 is the recommended way?
aj-michael commented 7 years ago

Only option 2 supports AAR files. It would be great if whatever solution we settled on supported arbitrary artifact packaging types.

jart commented 7 years ago

@kchodorow I've mailed you a changelist adding java_import_external to Bazel. The community should be able to expect it soon. I've also added very helpful documentation with examples.

wstrange commented 7 years ago

Speaking as a Bazel newbie, presenting multiple solutions for maven migration is very confusing.

A single, well supported, documented and "official" maven migration solution would be really nice, and I think is key for driving bazel adoption for Java projects.

ittaiz commented 7 years ago

@jart we (scala people) have a need to be able to turn off ijar creation for some external jars. A current ad hoc solution is to use the native maven_jar and a custom scala_import which uses the file instead of the java_library. Will it be possible to support disabling ijars on specific cases?

jart commented 7 years ago

If the Bazel authors add an attribute to java_import that turns off ijar creation, then java_import_external will absolutely be updated, since the latter is basically the same rule with some urls attributes added.

ittaiz commented 7 years ago

Thanks! @kchodorow are you the right person to ask?

cgrushko commented 7 years ago

@jart did you end up adding java_import_external to somewhere in Bazel?

jart commented 7 years ago

@cgrushko Indeed I did. It was added to the Bazel codebase 28 days ago in https://github.com/bazelbuild/bazel/commit/062fe70189fc622285833311d241021be313680b. Judging by the baseline, it doesn't look like it made it into 0.5.4, but it's certain to make it into the next one. I hope you enjoy this rule. Usage examples can be found in Closure Rules, Nomulus, and many other places.

wstrange commented 7 years ago

@jart Does that rule support authentication to a private maven repo (Artifactory in our case)?

If not, any ETA?

or-shachar commented 7 years ago

Hey @jart Rumor has it that you also created some gui tool for converting maven coordinates to java_import_external. Is it open sourced? We'd love to check it out!

StephenAmar commented 7 years ago

Any news regarding that web tool you've been mentioning in other bugs @jart I kind of want to migrate my repo to java_import_external, but without something like generate_workspace to resolve transitive dependencies, it's quite a lot of work.

jart commented 7 years ago

Behold Bazel Maven Config Generator in https://github.com/bazelbuild/bazel/pull/3946 and the demo video on YouTube. @or-shachar @StephenAmar

wstrange commented 7 years ago

From a quick glance of the above PR, it looks like this does not support private Maven repos such as Artifactory?

jart commented 7 years ago

@wstrange I don't see why it wouldn't. It also depends on what you mean. For example, you can just sed "repo1.maven.org" in index.html to whatever and it'll crawl the POMs. If you want to it to be able to crawl multiple POM repos, that might not be a trivial change.

Also keep in mind that java_import_external has no awareness of POM metadata. It just grabs jars from whatever URL. I'm also pretty sure Bazel's downloader can do HTTP auth using environment variables. See ProxyHelper.java. It's also probably possible to put the user:pass in the URLs itself, although you might not want to check that into your codebase.

It's also worth mentioning that Google Drive mirroring feature sort of magically and painlessly creates your own private Maven server on the fly. Although it just mirrors the JARs since that's all java_import_external needs.

wstrange commented 7 years ago

[Disclaimer: I am a Bazel newbie, so the questions I am asking may not make sense ;-) ]

The way our Artifactory repo works is that there could be several different repos defined, and each has a potentially different set of credentials. So the http auth credentials used by java_import_external would vary depending on which repo the dependency is coming from.

Maven handles all of this by using the credentials defined in ~/.m2/settings.xml. It is not clear to me how to accomplish the same thing with Bazel.

jart commented 7 years ago

Is Artifactory sort of like a really robust Squid caching proxy? Reading about it, I couldn't help but notice that Artifactory Enterprise Edition offers five-nines availability. I actually have a great deal of respect for the JFrog developers, for having achieving this level of reliability. It's a level of engineering most thought only AT&T and Chubby could master. Even Google Cloud Storage, with its transcontinental redundancy, is only able to promise three-nines. However java_import_external can actually deliver Erlang reliability. If the urls=[...] attribute has mirrors to three three-nine CDNs then you get nine-nines availability ((1-(1-99.9/100)*3)100=99.9999999.) If Jesus Christ used Bazel then there'd be about 63 seconds thence when builds could break on downloads. But if we consider that Bazel retries failed requests with exponential backoff for longer than that, then the reliability that spans the ages actually transcends nines and becomes 100. Bazel Community Edition can offer you this incredible level of value, not just for the low-low price of $29,500/year. No my friends, in fact, it doesn't even cost $14,750. You can have it all for the bargain basement price of zero dollars. Yes ladies and gentlemen it's free, and the source code comes included.

But it might need improvement when it comes to that private authentication use case. It's one I haven't considered, because I mostly do open source stuff. Also internally at Google we just vendor everything in our monolithic repo.

One thing you could do is put this in your zone:

$TTL 0
artifacts    IN  A    192.168.10.4
             IN  A    192.168.10.5
             IN  A    192.168.10.6

Put this on your servers:

import BaseHTTPServer
import SocketServer
import base64
import httplib
import shutil
import urlparse

basic = lambda u,p: 'Basic %s' % base64.b64encode('%s:%s' % (u,p))

AUTHORIZATIONS = {
    'maven.initech.com': basic('aladdin', 'opensesame'),
    'maven.vendoro.com': basic('aladdin', 'opensesame'),
    'localhost:5000': basic('aladdin', 'opensesame'),
}

class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
  def go(self):
    ru = urlparse.urlparse(self.path)
    pu = urlparse.ParseResult('', '', ru.path, ru.params, ru.query, ru.fragment)
    auth = AUTHORIZATIONS.get(str(ru.netloc))
    if auth:
      self.headers['Authorization'] = auth
    self.headers['Host'] = ru.netloc
    if ru.scheme == 'https':
      c = httplib.HTTPSConnection(ru.netloc)
    else:
      c = httplib.HTTPConnection(ru.netloc)
    try:
      c.putrequest(self.command, pu.geturl())
      for k, v in self.headers.items():
        c.putheader(k, v)
      c.endheaders()
      r = c.getresponse()
      self.send_response(r.status)
      for k, v in r.getheaders():
        self.send_header(k, v)
      self.end_headers()
      shutil.copyfileobj(r, self.wfile)
      self.wfile.flush()
    finally:
      c.close()
  do_GET = go
  do_HEAD = go

class ThreadedHTTPServer(SocketServer.ThreadingMixIn,
                         BaseHTTPServer.HTTPServer):
  daemon_threads = True

ThreadedHTTPServer(('', 4000), Handler).serve_forever()

Then run Bazel like this:

$ HTTP_PROXY=http://artifacts:4000 bazel build //...

And you should be good.

wstrange commented 7 years ago

So I think what you are saying is that when you are at 10 nines of availability, you have no place to go. Bazel goes to 11 nines.

Artifactory and Nexus are very common in the "enterprise" space. If Bazel is to attract hordes of Java developers (and that may not be a goal ;-) ), having first class support for private maven repositories (with authentication) is essential.

The proxy idea is super creative (I really appreciate you taking the time to put together a solution). I'll review it - but I think it will be a non starter in my organization. The solution has to be integrated and out of the box.

I return to looking at Bazel every 6 months or so, because we desperately need something like it (maven build and test times are getting absurd). But I have to sell this internally, and the maven migration experience is just not there yet. I'll be back though ;-)

pcj commented 7 years ago

Hi Warren. I'd encourage you to file an issue on rules_maven. It uses gradle to resolve transitive deps under the hood. As gradle already factors in the settings.xml file when fetching artifacts, I'd gander a bet that getting this to work might not be too hard. We'd just have to be able to pass in your settings.xml file as a label to the maven_repository rule such that it can be discovered. It may also require some tweaking of the repositories attribute that maps GROUP:NAME patterns to the (artifactory) url where those artifacts can be found.

jart commented 7 years ago

@wstrange I encourage you to file a feature request asking for the ability to add to say fetch --auth user:pass@user.com in ~/.bazelrc so downloader can do Basic Authentication (see also). It's not an unreasonable thing to ask, and wouldn't be difficult to implement. But there's the proxy solution in the interim.

I can't speak for the Bazel team or Google, but I'm sure they want nothing more than the largest number of people to benefit from Bazel as possible. While we're in the business of sharing world-class technology, we can't always be in the business of solutions, and some assembly is required. I think that's OK, because it creates opportunities for entrepreneurs to build those turn-key solutions on top of the work we're sharing.

For example, nothing would make me happier than to see someone come along, take that Apps Script I posted a few comments ago, and get rich turning it into a business. If that ends up being one of you, buy me a drink next time you're in the Bay Area.

StephenAmar commented 6 years ago

@jart Thanks a lot for the config generator. It was very useful.

A tricky question for you though. I'm having a lot of trouble using extra_build_file_content because I can't seem to be able to use non native rules there (like a rule to shade libraries, or scala specific rules).

Any ideas?

jart commented 6 years ago

I would advise against doing anything nontrivial in extra_build_file_content. You can probably do it in your main repo build files. Otherwise, you might be able to load() the appropriate skylark rules, possibly using "@//..." syntax to reference the main repo.

dslomov commented 5 years ago

All such feature requests now belong in https://github.com/bazelbuild/rules_jvm_external