support input discovery

johnynek commented 5 years ago

see this thread: https://groups.google.com/forum/#!topic/bazel-dev/oFRdGdrm8DM

here is the design doc: https://github.com/emmanuel-p/proposals/blob/b207bba34e0d5a47be478087c41396764abdff29/designs/2019-04-04-starlark-dependencies-pruning.md

it is implemented now: https://groups.google.com/d/msg/bazel-dev/oFRdGdrm8DM/k0s1vwvlCAAJ

The idea is simple: produce a list of unused files by an action. The design doc is pretty unclear about the format of the file. I assume one line per file? This commit seems to be confirming that guess: https://github.com/bazelbuild/bazel/commit/57363813d3c72129c61c35a2b03c3033b62cb49d#diff-b604907e2f234f2839600aabf9dc4498R139

Looks like a UTF-8 file, one line per file name, empty lines are ignored, lines are .trimed.

JaredNeil commented 5 years ago

Is this even possible to do correctly in Scala? Scala package paths don't have to match the file path (and a single file can declare multiple packages), so every source file must be "used" to determine if it declares a package that was imported anywhere else.

Does Scala have any kind of "strict packages" mode that behaves more like Java?

johnynek commented 5 years ago

we already have the unused dependency checker which we optionally use to error. We can use the same code and just emit the list of unused jars rather than error.

It isn’t about individual files inside a target, it is about the full target output which bazel is tracking. I don’t see why the above won’t work.

ittaiz commented 5 years ago

I haven’t read the above (and it’s exiciting) but I just need to re-iterate that from experiments I did inside Wix’s codebase the unused dependency checker was way way off often declaring used dependencies as unused.

On Thu, 27 Jun 2019 at 0:14 P. Oscar Boykin notifications@github.com wrote:

we already have the unused dependency checker which we optionally use to error. We can use the same code and just emit the list of unused jars rather than error.

It isn’t about individual files inside a target, it is about the full target output which bazel is tracking. I don’t see why the above won’t work.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/rules_scala/issues/777?email_source=notifications&email_token=AAKQQF5NU4347RZAAGUG2E3P4PL4LA5CNFSM4H3NLIGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYU2Z4I#issuecomment-506047729, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKQQFYIIRAU65TJMURVRQ3P4PL4LANCNFSM4H3NLIGA .

johnynek commented 5 years ago

It would be great to reproduce those. We run it turned on in almost all targets but there are a few cases where it is wrong.

With more reproductions of misfiring cases we should be able to fix the issues.

JaredNeil commented 5 years ago

You'll have the same problem at the jar level, since a jar can contain any package. You have to open the jar to know if the package you want is in there. If it was unused before, there is no guarantee it is still unused.

Contrived example ahead: https://gist.github.com/JaredNeil/a41e24ec9f1038140c36c705dae8ab58

bazel build :collision will warn about :no being unused, because it is, but the build will pass.

warning: Target '//:no' is specified as a dependency to //:collision but isn't used, please remove it from the deps.
You can use the following buildozer command:
buildozer 'remove deps //:no' //:collision
one warning found
Target //:collision up-to-date:
  bazel-bin/collision.jar

Then switch the package in no.scala from no to yes and bazel build :collision again. Now you'll get a compile error:

collision.scala:6: error: type mismatch;
 found   : Int(3)
 required: String
  def x = 3
          ^
one error found
one error found
java.lang.RuntimeException: Build failed
        at io.bazel.rulesscala.scalac.ScalacProcessor.compileScalaSources(ScalacProcessor.java:244)
        at io.bazel.rulesscala.scalac.ScalacProcessor.processRequest(ScalacProcessor.java:69)
        at io.bazel.rulesscala.worker.GenericWorker.run(GenericWorker.java:114)
        at io.bazel.rulesscala.scalac.ScalaCInvoker.main(ScalaCInvoker.java:41)
Target //:collision failed to build

But if you had produced an "unused files list" that contained no.jar during the first build, no.jar would have been ignored, there would be no changes, and the rebuild would have been skipped. Thus, a clean build would have a compile error, but an incremental build would "pass" because it wouldn't run.

johnynek commented 5 years ago

Okay. I see your objection. I believe this applies to the feature itself in bazel. Your point is that changing the contents of an unused file can make it either used or an error.

So, it is only a kind of heuristic that works in some common cases.

This seems like a good point. I’m not sure anyone raised it. It certainly applies to Java jars or zip files as well. Maybe C++ uses multiple outputs to side step this.

I guess one approach would be to output a directory of class files, but still your point about generating collisions holds.

pauldraper commented 5 years ago

For Scala (and Java) the solution is to create a new action to list classpath classes and check for duplicates. This is a fast operation.

Then add that action as a dummy input to the compilation action.

So a situation like you describe @JaredNeil would produce a failure before the compilation step. But a change in an unused class still produces a cache hit for the compilation action.

pauldraper commented 5 years ago

However, I think something significant hasn't been mentioned.

Anyone using strict deps doesn't have any unused inputs. As is, this would be an alternative to strict deps, not a performance improvement on top of it.

If you wanted to improve performance, ideally, inputs would be need to be more granular than they already are. You can't use classfiles because the inputs/outputs have to be known at analysis time, and I assume there's no special treatment of directories.

You could use heuristics to produce multiple JAR files. Examples:

At analysis time, create 0.jar, 1.jar, 2.jar, n.jar (with n being based on number of source files). At execution time, bucket the classfiles by filename into those.
At analysis time, create JARs based on the source file names, plus one for leftovers. At execution time, bucket classfiles by filename into those.

The key to produce relatively stable relationships between sources and compilation outputs, so that a change to a source file changes a minimal number of compilation outputs (which are future compilation inputs).

pauldraper commented 5 years ago

The primary obstacle -- I believe -- is actually wildcard imports. They can shadow and produce usage patterns that don't work well with input discovery.

To be correct, you need one of

Add a manifest of all classes as an used input, triggering compilation each time a class is added/removed. This drastically hurts cache hits. (And Scala produces a lot of classes.)
Track wildcard imports. Even zinc has some difficulty with this.
Disallow wildcard imports, or at least certain patterns. But wildcard imports are idiomatic/common.

bazelbuild / rules_scala

support input discovery #777