benchkram / bob

Bob is a high-level build tool for multi-language projects.
https://bob.build
Apache License 2.0
461 stars 14 forks source link

`task semantics` rebuild behavior #229

Open Equanox opened 2 years ago

Equanox commented 2 years ago

The current behavior of input/target combination feels a bit odd, which could lead to suprises for new users.

Current Implementation

1) rebuild on input change
  build:
    input: *
    cmd: go build -o run
    target: run

2) no input =>  will only build once
  build:
    cmd: go build -o run
    target: run

3) no input, no target => will only build once (creates a target not known to bob)
  build:
    cmd: go build -o run

4) no input, no target => will only build once
  build:
    cmd: echo Hello World

5) no target => rebuild on input change
  build:
    input: *
    cmd: echo Hello World

6) rebuild when dependecy changed (the task does nothing)
  build:
    dependOn: [ generated ]

7) rebuild when dependency or input changed (the task does nothing)
  build:
    input: *
    dependOn: [ generated ]

Changing the default value for input to be *

1) rebuild on input change 2) rebuild on input change 3) rebuild on input change (creates a target not known to bob which is problematic) 4) rebuild on input change 5) rebuild on input change 6) rebuild on input change & dependency change 7) rebuild on input change & dependency change

Taking target into account

Appendix

Case 3) could be detected by a bob verifytaskintegrity (please use a better command name) which would build tasks one by one and track filesystem & docker registry changes. At the end it prints some target suggestion for specific tasks. This could also be helpful when migrating from other build systems or adding docker builds. In that cases targets usually are hidden deep down in the build instruction (e.g. Dockerfile)

puengel commented 2 years ago

Suggestion for task name bob verify and then add what you want to verify tasks --> bob verify tasks.

In my opinion bob should act this way: Whenever bob can't verify it input or target is changed it should build. Probably this would result in default rebuild: always if input and target are unset.

  1. rebuild on input change
  2. always rebuild
  3. always rebuild
  4. always rebuild
  5. rebuild on input change

We could also take dependencies into account.

5) no input, no target, with dependency
  build:
    cmd: echo Hello World
    dependsOn: otherTask
  1. build whenever dependency was build
dketterer commented 1 year ago

I think we should try to mimic the make behavior "PHONY" tasks of rebuilding as often as possible.

For tasks that can not be cached I suggest we add that information to the console summary at the end of bob build. That helps the user to identify possible optimizations.

Besides taking the presence of input and target into account I want to make explicit that the absence of the target file or no files in the target folder qualifies as reason to rebuild.

I suggest to following behavior:

  1. rebuild on input change
  2. always rebuild
  3. always rebuild
  4. always rebuild
  5. always rebuild
  6. rebuild on dep change
  7. always rebuild

About 5 and 7:

I argue that if you don't know if the target is still present or if it was deleted or changed then you should always rebuild. In the current implementation failures are hard to debug. Bob says "cached" which is not true since if bob was asked to show the "cached" files it would have no answer. IF it was to stay the current way, bob should at least change the information in the summary to something like "not-rebuild".

Further, a call of bob clean targets should let ALL tasks rebuild. In the current implementation there is no way to trigger a rebuild of case 5) without changing the input.

Equanox commented 1 year ago

Bob is a simpler Bazel alternative for multi-language projects of any size. It's simpler to use then Bazel as it allows to build targets/outputs directly in the repository. This includes that when in doubt we should optimize for bigger repositories.

With that in mind let's go one step back and bring this down to the physics to answer the following questions.

Analysis

We have to deal with three kind of tasks:

A (Verify/Analyze a Codebase)

Impacted Scenarios: 4) 5) A task which analyzes a codebase needs inputs in some cases to avoid unnecessary executions. Linting can be an expensive operation, and avoiding task execution can be beneficial.

lint:
 input: ui/**/*.ts
 cmd: npm run lint

B (Alter a Codebase)

Impacted Scenarios: 1) 4) 5) A task altering a codebase needs inputs & outputs.

The case of fmt and lint

For something like go fmt or lint --fix the input & output are the same files.

fmt:
  input: */**/*.go
  cmd: go fmt ./…
  target: */**/*.go # Which is not possible.

Defining those kind of targets is tedious and maybe even unnecessary as those are changes which are best transported through a repository/git and NOT through an shared artifact cache. Though it can be useful in a pipeline to assure code quality standards like formatting. This can usually be done by using a task like this test -z $(go fmt ./…)

The case of go mod tidy

Let's also look at go mod tidy which could also be interpreted as a task which should be done manually. But due to its nature of only changing two files it is tempting to define go.mod & go.sum as targets.

gomodtidy:
  input: */**/*.go
  cmd: go mod tidy
  target: |
    go.mod
    go.sum

It modifies go.mod and go.sum up to the current state defined in the sources and download dependency to a local cache directory which is not located in the repository. It is usually used to bring the system into a state on which it is possible to build the project. Therefore it's beneficial to have it in a build graph. Hint: For this to succeed all code must be available.

Here is a discussion on go mod tidy when used in bazel, https://github.com/bazelbuild/rules_go/issues/2262

C (Create)

Impacted Scenarios: 1) These tasks are actually doing the main work and they require inputs and outputs and usually they also depend on other tasks to span the build-graph.

worker:
  input: */**/*.go
  cmd: go build -o run
  target: run
  dependsOn: 

Conclusions

Why should we enforce the default input be empty and a rebuild policy of "No Input Never Built"?

Can a plugin system make the life of users easier (e.g. thinking less about inputs)? I guess so. Lot of input, targets or combinations of both are task specific and can be reused.

When to define targets? Aka, what kind of tasks need NO targets? Tasks which targets/outputs are transported through the repository like go fmt or npm lint --fix. Files which are put on a .gitignore like binaries or node_modules.

When advising users to do thing like go fmt ./... outside of the build tree. How to assure the same go version as in the bob.yaml? As already suggested by @dketterer a nice solution would be to spawn a shell into the environment of a task, similar to what nix-shell does.

Behavior based on those changes

1: rebuild on input change (or invalidated target). 2: error, no input provided! 3: error, no input provided! 3: error, no input provided! 4: error, no input provided! 5: rebuild on input change => As @dketterer change summary to "not-rebuild" instead of "cached". 6: Never build ( only dependencies). Usually a sum task or main entry point. 7: Invalid task (input or target without command should be invalid)

This hopefully removes some confusion as there are less valid cases which must be considered.

Answers

@dketterer

Further, a call of bob clean targets should let ALL tasks rebuild. In the current implementation there is no way to trigger a rebuild of case 5) without changing the input.

Use bob clean system to trigger a rebuild. Maybe a better solution would be a command like bob clean . to only remove buildinfo from the current project.

@puengel I can't think of any real world scenario. Would be nice to find one. Otherwise i tend to ignore that case to reduce possible scenarios.

5) no input, no target, with dependency build: cmd: echo Hello World dependsOn: otherTask

Appendix

Why do input & target exist in Bob?

input: Skip the task target: Share / Correctness / Performance (Bob Internal)

Inner working

Before doing any work Bob builds an internal view of the repository based on inputs & output files. It does not yet read the contents of the files, though it removes outputs of child tasks from the current tasks inputs. Only when actually sending a task to the build queue the file content is read and input & target hashes are computed and compared which still has some potential for optimizations but runs already in parallel.

dketterer commented 1 year ago

I have some comments that we already discussed:

About case 3/4/5:
Just to be explicit about it: They should not return errors if the rebuild: always flag is set.

Regard bob clean target or bob clean .:
I think it would be valuable to trigger the rebuild of all tasks in a bob.yaml be deleting all targets defined in it and their cache / cache status.

There is already an old issue (https://github.com/benchkram/bob/issues/73) with this topic which needs refinement since it was before the introduction of targets and system.

rdnt commented 1 year ago

I agree with the above suggestion by @Equanox and the additional comment from @dketterer; cases where the user would get 'error: no input specified' should not return the mentioned error if the rebuild: always flag is set.

The tasks mentioned in case (6) are called "compound tasks" (per vscode & intellij's docs, for example) and I agree in that they should always just be a way to group tasks together.

Regarding the discussion we had today offline, I think we do not have to introduce a new task type, but just consider tasks with rebuild: always and their dependents as simple run tasks. Caching and incremental builds should be turned off for these and their dependents, and users will not have to resort to separate makefiles etc. if we treat them that way. This solution can also be quite robust, as a part of the DAG can use the incremental build system and the rest can just exist as a task runner. The run tasks we currently have were also modeled with this behavior, albeit with little integration with the actual build system part.

Additionally, for tasks with inputs and no targets, we should consider capturing stdout/stderr of them and storing that text as target. I'm thinking of a lint step where you ideally don't want to execute it multiple times for the same filetree, but in case of errors during lint you want these to continue showing up.

Equanox commented 1 year ago

@rdnt stumbled upon another, not yet, well defined case. rebuild:always with a target set.

The question is: Should we allow a target when rebuild:always is set?

Analysis

In case of a repository internal source (depends on files inside the repository) the task output should not be declared as a target, cause the output can always be regenerated and therefore must not be transported through an artifact. Example

gomodtidy:
  cmd: go mod tidy
  rebuild: always

In case of a external source with a output written to the repository and no source files in the repository (like a package.json for npm install) the output should either be stored inside the repository (vendoring) or always being downloaded using rebuild: always. For both cases it’s not necessary to create a target.

downloadblogposts:
  cmd: curl -X get https://.....
  rebuild: always

In both cases I'd be careful adding such a task to a build-graph as it forces a rebuild of all parent tasks.

Discussion

From a functionality point of view a target should not be set when rebuild: always is set.

BUT

One reason for rebuild: always being used together with a target is debugging. It allows to easily change the rebuild behavior.

Conclusion

Printing a warning in that case seems to be the most user friendly approach for now.

Appendix

Internally I'd like to keep the rebuild: always functionality in the future... it might just be necessary to remove it from the yaml in the long run.