How to generate one output from many inputs (aggregate builder)?

filiph commented 7 years ago

It's not clear to me how to generate one output file that depends on many input files.

Let's say I want to generate a file that includes names of all subclasses of class X in a directory. *) These subclasses can live in various files in my /lib but I only want to build one JSON file.

It's not clear how to do this. BuildStep has only one inputId and while there's hasInputs, there is no way to get all of them (or a subset). Builder only ever gets one BuildStep.

In Transformer world, there's AggregateTransformer. If there's a similar thing in Build then it's not well documented.

*) Why would I want to do this? For easier testing, for example. When I implement a new subclass I want to make sure it's used in every place where it should be used as soon as I run unit tests (or sooner, via static errors). I use this in Python and it's really, really effective.

jakemac53 commented 7 years ago

There is no AggregateTransformer option in build today, but there isn't any technical reason it couldn't exist.

The primary reason it doesn't exist is because its usage was heavily abused in transformers in ways that caused extreme slowness, and I hadn't seen a valid use case up until this point (in practice, plenty in theory).

People tend to use this any time they need to read more than one file (not realizing they could use the Transform object to read additional files, or maybe they found it easier?), and would wrap up all codegen for an entire package in one single step instead of treating each entry point individually.

We could achieve the same functionality through two different features, a Builder with no specified entry point file (we should do this anyways I think) and an api that is equivalent to Directory#list. The internal AssetReader interface already has this directory listing capability, and we could implement that in the barback world using an AggregateTransformer. It would not be able to list your dependencies files, but that might be ok. This would have to be opt in, at least in the transformer world, as we don't want all wrapped builders to be aggregate transformers. WDYT?

jakemac53 commented 7 years ago

cc @natebosch @kevmoo

natebosch commented 7 years ago

What would be the api for configuring this? Would there be a BuildStep per directory? A single BuildStep for the entire build?

jakemac53 commented 7 years ago

What would be the api for configuring this? Would there be a BuildStep per directory? A single BuildStep for the entire build?

Definitely up for debate, but one option would be a different type of Builder, whose basic api could look something like this:

// better name tbd
abstract class NoInputBuilder {
  // Possibly supply the root package name somehow?
  Iterable<AssetId> get outputs;

  // NoInputBuildStep is a BuildStep with no `inputId`. It should probably have
  // a `package` getter though (the current package being ran on).
  Future build(NoInputBuildStep buildStep);
}

This would mean setting up these build steps globally as part of the build, as they have fixed outputs.

So for the example that prompted this issue you would have something like the following:

class SubclassCollectorBuilder implements NoInputBuilder {
  @override
  final outputs = [new AssetId('my_package', 'test/all_subclasses.json')];

  @override
  Future build(NoInputBuildStep buildStep) {
    // Lists all files under `buildStep.package`.
    var files = await buildStep.list(globs: ['lib/**/*.dart']);
    // Do whatever you want with those files, then use normal write apis.
    // Open question: How to get a resolver?
  }
}

natebosch commented 7 years ago

If we're ok with a BuildStep per package I think we can make that work.

I wonder how much we can reuse existing interfaces rather than create parallel families. The NoInputBuildStep could be a BuildStep where the inputId is an asset with a package but no path.

The Builder class should probably be separate.

jakemac53 commented 7 years ago

If we're ok with a BuildStep per package I think we can make that work.

Ya, this doesn't really shield users from the same pitfalls as AggregateTransformer unfortunately. I do think there are valid use cases for Builders which don't have a single primary input though.

I wonder how much we can reuse existing interfaces rather than create parallel families. The NoInputBuildStep could be a BuildStep where the inputId is an asset with a package but no path.

I thought about this as well but I think its going to be more error prone. It would also really suck to have path be nullable if we get non-nullable types in the future. Pretty much all code today assumes a non-null path for any AssetId.

That being said I hate the NoInputBuildStep name, I am just drawing a blank on a better one right now. Maybe PackageBuildStep (and PackageBuilder)?

matanlurey commented 7 years ago

/me chimes in

I think the confusing part for users right now is "how do I read and resolve multiple files to build a single output". It would be nice for our solution to also help here so we don't have to introduce yet-another type or breaking change in the near future.

jakemac53 commented 7 years ago

@matanlurey Yes this same solution can solve that generally, we just have to figure out the api for the resolve part but that is really only an api question.

jakemac53 commented 7 years ago

Sidenote @filiph: Are all your subclasses you care about reachable from your main entry point dart file in lib? If so you could could just have that be your primary input, and then you just resolve the world from there and crawl for all the subclasses you care about.

natebosch commented 7 years ago

I think we could also add globs and/or directory listing capabilities on the existing BuildStep if that would help.

The main question is whether the builder can be modeled as <input file> -> <output> or if it needs to be <package> -> <output>

jakemac53 commented 7 years ago

I think we could also add globs and/or directory listing capabilities on the existing BuildStep if that would help.

Yes, I wasn't super clear about that but I was proposing adding that to both BuildStep impls. The main issue with that though is that we need to have two different BuilderTransformer wrappers, one which supports that (via AggregateTransformer) and one which doesn't.

filiph commented 7 years ago

@jakemac53: Not originally, but it's a valid approach, yes.

To give more context: I've had the following happen to me many times, and not only in Dart:

There's a structure that needs many implementations. Either subclasses or functions or instances. A typical example would be a class per type of component.
I implement a new member of this set.
I run tests and everything's green.
I run the app and after way too many seconds of pondering I realize I haven't added the new member to file X.
<facepalm>

I've seen that when this is done automatically (or when the omission is reported as error early) there is a productivity boost.

The approach with a single entry-point is valid but only useful if there is file X that includes all members and I don't forget to add the new member there (to file X). In reality, I often forget even that.

jakemac53 commented 7 years ago

The approach with a single entry-point is valid but only useful if there is file X that includes all members and I don't forget to add the new member there (to file X). In reality, I often forget even that.

@filiph that is a valid point. Do you have an opinion about an api that lets you list directories versus something closer to the existing aggregate transformer api?

natebosch commented 7 years ago

There are 2 parts to this:

What is the 'anchor' for the output file - what input are we running on. Today the 'anchor' is an input file. Should we also allow a directory or a package as an 'anchor'?
How do you discover what exists if you can't crawl an import graph?

I think we're in agreement that we want a glob or directory listing api on BuildStep which handles the second point.

The question I still have is the first point. Would it be acceptable to have a file as an 'anchor' for the result?

Here's a straw man: I have a directory lib/implementations/ which has a variable number of files. I want to create a single file lib/all_implementations.dart which has an entry for each file under lib/implementations/ or each class within all of those libraries.

Is it OK for us to require the (empty) file lib/all_implementations.placeholder and then write a Builder which takes .placeholder as input and creates .dart as output, and in the middle reads all the files under lib/implementations/. It won't actually read it's input, just use it to know that an output must exist there.

jakemac53 commented 7 years ago

Is it OK for use to require the (empty) file lib/all_implementations.placeholder and then write a Builder which takes .placeholder as input and creates .dart as output, and in the middle reads all the files under lib/implementations/. It won't actually read it's input, just use it to know that an output must exist there.

I don't think we should require that, it does have the advantage of simplifying the api surface area but its also a pretty major annoyance for developers I would think.

natebosch commented 7 years ago

After some offline discussion: It would be expensive to have a new variant of Builder and all the associated config and machinery. We know we want to support the directory listing stuff either way - we'll do that and document the patterns for using the placeholder file pattern. After that we can get some feedback on the pattern and reevaluate whether we want a way to handle it without the placeholder. Being able to do it without the placeholder might be the nicer UX, but we aren't sure yet if it's worth the cost.

filiph commented 7 years ago

FWIW, my initial expectation was that I'll get a build step that has "statically" specified glob of inputs (so, not globbing inside build() but before that, in Phases). But I may be biased from barback and blaze.

I'm of course okay with anything that works.

matanlurey commented 7 years ago

I think having better documentation and examples is likely going to be the story here looking at the work required - we should show an example of converting an aggregate transformer as part of the docs!

natebosch commented 7 years ago

One thing I hadn't realized but came up in gitter while discussing sass_builder - the placeholder file + directory listing APIs don't help use with picking up changed files.

We probably will need to have an AggregatBuilder api so that multiple files can trigger the rebuild

matanlurey commented 7 years ago

@natebosch How will this work with bazel? Isn't that already the case?

natebosch commented 7 years ago

Yes I think with bazel we could make it work without AggregateBuilder since bazel is deciding when to run and the other files would be srcs

I suppose we could also look at build_runner and see if we can make it smarter about these things. @jakemac53 any thoughts?

luisvt commented 7 years ago

would it be possible to add inputs field to the Builder or to create an AggreateBuilder that extends Builder, something like:

abstract class AggregateBuilder extends Builder {
  Set<AssetId> inputs;
}

then in https://github.com/dart-lang/build/blob/master/build_runner/lib/src/generate/build_impl.dart#L349 you could add:

if(action.builder is AggregateBuilder) action.builder.inputs = inputs;

then Inside my AggreateBuilder I just would check if any of the inputs could be treated as mainInput and then process it. So in sass_builder I would do something like:

class SassBuilder extends AggreateBuilder {
  @override
  Future build(BuildStep buildStep) async {
    var inputId = buildStep.inputId;

    if (basename(inputId.path).startsWith('_')) {
      inputs.where((input) => !basename(input.path).startsWith('_')).forEach((input) {
        var css = await render(input.path, packageResolver: await PackageResolver.current.asSync);
        buildStep.writeAsString(_changeExtension(inputId), css);
      });
      return;
    }

    var css = await render(inputId.path, packageResolver: await PackageResolver.current.asSync);
    buildStep.writeAsString(_changeExtension(inputId), css);
}
...

jakemac53 commented 7 years ago

One thing I hadn't realized but came up in gitter while discussing sass_builder - the placeholder file + directory listing APIs don't help use with picking up changed files.

We probably will need to have an AggregatBuilder api so that multiple files can trigger the rebuild

This shouldn't be an issue, but we would need to add a notion of entire directory input nodes to the asset graph which shouldn't be too challenging? We could also integrate glob support to help you trim down the dependencies.

Essentially the additional logic would be the following:

When you list a directory (with a glob or not):
- Create a directory asset node in the graph and add all outputs of the current action as outputs of the directory node (same logic as if you read additional files using the buildStep, just a different type of node).
- Add this to a special set of all directory nodes so we can look them up quickly later.
- Specific asset nodes will be added as inputs only if you actually read them (adds/deletes will be covered by the directory node being invalidated).
When a file is created or deleted:
- If any 'directory' nodes matches its path:
- Invalidate all outputs of the directory node.
When a file is edited there is no extra work required.

natebosch commented 7 years ago

We definitely want to do the glob api so I filed a separate issue to track - the remaining points around placeholder files etc are still unsettled.

jodinathan commented 7 years ago

Answered my SO question with a full example using a placeholder file:

https://stackoverflow.com/questions/43902443/generate-one-file-for-a-list-of-parsed-files-using-source-gen-in-dart/44396362#44396362

jakemac53 commented 7 years ago

Closing as the remaining issues with placeholders is a dupe of https://github.com/dart-lang/build/issues/421

filiph commented 6 years ago

Just to clarify: is there currently still no plan to introduce an AggregateBuilder? Should we consider the placeholder file workaround a final solution (for the foreseeable future)?

I'm fine with that, just checking that I'm not missing some context.

matanlurey commented 6 years ago

@filiph Yup, we went with a more generic approach that will work with all build systems.

You can see a decent example of this expected behavior here: https://github.com/dart-lang/build/commit/65b08f6d5710ed78389bf1db74c5dc415e1b4819

We definitely could use better user-documentation on creating these types of builders, though.

thosakwe commented 6 years ago

@matanlurey So, if I’m understanding this correctly, the approach is to wrap the inputs in a regex, surrounded by $?

jakemac53 commented 6 years ago

@matanlurey So, if I’m understanding this correctly, the approach is to wrap the inputs in a regex, surrounded by $?

Not quite - there are special synthetic assets in a few top level directories, which are only there for you to use as placeholders for these types of builders. These look like lib/$lib$, web/$web$, etc (that is the full path).

To glob your sources you would use the findAssets api on BuildStep, you simply ignore the primary input in this case (they are not readable).

Note that this is not actually the solution described in previous comments, its a simpler solution that we landed on after some offline conversations. In practice it actually seems to work out quite well.

dart-lang / build

How to generate one output from many inputs (aggregate builder)? #213