com-lihaoyi / mill

Your shiny new Java/Scala build tool!
https://mill-build.com/
MIT License
1.99k stars 303 forks source link

Support `module.sc` files in subfolders #3213

Open lihaoyi opened 2 weeks ago

lihaoyi commented 2 weeks ago

This PR implements support for per-subfolder build.sc files, named module.sc. This allows large builds to be split up into multiple build.sc files each relevant to the sub-folder that they are placed in, rather than having a multi-thousand-line build.sc in the root configuring Mill for the entire codebase.

Semantics

  1. The build.sc in the project root and all nested module.sc files are recursively within the Mill directory tree are discovered (TODO ignoring files listed in .millignore), and compiled together

  2. We ignore any subfolders that have their own build.sc file, indicating that they are the root of their own project and not part of the enclosing folder.

  3. Each foo/bar/qux/build.sc file is compiled into a millbuild.foo.bar.qux package object, with the build.sc and module.sc files being compiled into a millbuild package object (rather than a plain object in the status quo)

  4. An object blah extends Module within each foo/bar/qux/build.sc file can be referenced in code via foo.bar.qux.blah, or referenced from the command line via foo.bar.qux.blah

Design

Uniform vs Non-uniform hierarchy

One design decision here is whether a module.sc file in a subfolder foo/bar/ containing object qux{ def baz } would have their targets referenced via foo.bar.qux.baz syntax, or via some alternative e.g. foo/bar/qux.baz.

A non-uniform hierarchy foo/bar/qux.baz would be similar to how Bazel treats folders v.s. targets non-uniformly foo/bar:qux-baz, and also similar to how external modules in Mill are handled e.g. mill.idea.GenIdea/idea, as well as existing foreign modules. However, it introduces significant user-facing complexity:

  1. What's the difference between foo/bar/qux.baz vs foo/bar.qux.baz or foo/bar/qux/baz?
  2. What query syntax would we use to query targets in all nested module.sc files rather than just the top-level one e.g. __.compile?
  3. Would there be a correspondingly different way of referencing nested module.sc modules and targets in Scala code as well?

Bazel has significant complexity to handle these cases, e.g. query via ... vs :all vs *. It works, but it does complicate the user-facing semantics.

The alternative of a uniform hierarchy also has downsides:

  1. How do you go from a module name e.g. foo.bar.qux.baz to the build.sc or module.sc file in which it is defined?
  2. If a module is defined in both the root build.sc and in a nested module.sc, what happens?

I decided to go with a uniform hierarchy where everything, both in top-level build.sc and in nested module.sc, end up squashed together in a single uniform foo.bar.qux.baz hierarchy.

Package Objects

The goal of this is to try and make modules defined in nested module.sc files "look the same" as modules defined in the root build.sc. There are two possible approaches:

  1. Splice the source code of the various nested module.sc files into the top-level object build. This is possible, but very complex and error prone. Especially when it comes to reporting proper error locations in stack traces (filename/linenumber), this will likely require a custom compiler plugin similar to the LineNumberPlugin we have today

  2. Convert the objects into package objects, such that module tree defined in the root build.sc becomes synonymous with the JVM package tree. While the package objects will cause the compiler to synthesize object package { ... } wrappers, that is mostly hidden from the end user.

I decided to go with (2) because it seemed much simpler, making use of existing language features rather than trying to force the behavior we want using compiler hackery. Although package objects may go away at some point in Scala 3, they should be straightforward to replace with explicit export foo.* statements when that time comes.

Existing Foreign Modules

Mill already supports existing foo.sc files which support targets and modules being defined within them, but does not support referencing them from the command line.

I have removed the ability to define targets and modules in random foo.sc files. We should encourage people to put things in module.sc, since that would allow the user to co-locate the build logic within the folder containing the files it is related to, rather than as a bunch of loose foo.sc scripts. Removing support for modules/targets in foo.sc files greatly simplifies the desugaring of these scripts, and since we are already making a breaking change by overhauling how per-folder module.sc files work we might as well bundle this additional breakage together (rather than making multiple breaking changes in series)

build.sc/module.sc file discovery

For this implementation, I chose to make module.sc files discovered automatically by traversing the filesystem: we recursively walk the subfolders of the root build.sc project, look for any files named module.sc. We only traverse folders with module.sc files to avoid having to traverse the entire filesystem structure every time. Empty module.sc files can be used as necessary to allow module.sc files to be placed deeper in the folder tree

This matches the behavior of Bazel and SBT in discovering their BUILD/build.sbt files, and notably goes against Maven/Gradle which require submodules/subprojects to be declared in the top level build config.

This design has the following characteristics:

  1. In future, if we wish to allow mill invocations from within a subfolder, the distinction between build.sc and module.sc allows us to easily find the "enclosing" project root.

  2. It ensures that any folders containing build.sc/module.sc files that accidentally get included within a Mill build do not end up getting picked up and confusing the top-level build, because we automatically skip any subfolders containing build.sc

  3. Similarly, it ensures that adding a build.sc file "enclosing" an existing project, it would not affect Mill invocations in the inner project, because we only walk to the nearest enclosing build.sc file to find the project root

  4. We do not automatically traverse deeply into sub-folders to discover module.sc files, which means that it should be almost impossible to accidentally pick up module.sc files that happen to be on the filesystem but you did not intend to include in the build

This mechanism should do the right thing 99.9% of the time. For the last 0.1% where it doesn't do the right thing, we can add a .millignore/.config/millignore file to support ignoring things we don't want picked up, but I expect that should be a very rare edge case

Compatibility

This change is binary compatible, but the change in the .sc file desugaring is invasive enough we should consider it a breaking change

Pull request: https://github.com/com-lihaoyi/mill/pull/3213

lefou commented 2 weeks ago

What are the semantics when mill is run from a sub-directory? Is this handled in any way?

  1. No, Mill will as usual try to handle the directory as project.
  2. Partly, Mill will detect and warn about being executed in a sub-project.
  3. Yes, Mill will find the root and interpret all path selectors with a sub-module prefix.
lihaoyi commented 2 weeks ago

@lefou I haven't thought about that yet. I'm not sure what the best thing to do is. I'm guessing (2) or (3), just because that's what other build tools seem to do

lefou commented 2 weeks ago

@lefou I haven't thought about that yet. I'm not sure what the best thing to do is. I'm guessing (2) or (3), just because that's what other build tools seem to do

We should remind our single-source-(file)-of-truth concept and make split-up projects explicit.

Instead of looking up (guessing) sub-projects, we should explicitly list them in the parent project. I think we previously handled kind-of sub-projects with import @file but broke it later, but some import statement or better some usage directive would be nice. Additionally or alternatively, we could denote the root/parent project in the sub-project, so lookup is faster and more accurate, if we decide to support running from a sub-project is a good idea.

I would strongly recommend to not guess sub-project relations solely on the existence of files. Then we end up with a chaotic and non-reproducible setup (which you can experience with Gradle for example).

lefou commented 2 weeks ago

With an explicit sub-project configuration, it would be possible to aggregate multiple stand-alone projects (e.g. by pointing to a location outside the current project or some git submodules). As long as sub-projects don't use any resources of the aggregating project, their build results (in out/) should also be independent and we could just use the separate project-local out-directories for each aggregated project.

So, simply aggregating multiple Mill projects should not require a rebuild of each single project. But it would make re-use of external project easy and lightweight.

lihaoyi commented 2 weeks ago

I think there are a few orthogonal things here

Aggregating standalone projects

This can be done with or without explicit references between subfolder build.sc files. Arguably it's even simpler without explicit references: aggregation then becomes a matter of a standard filesystem symlink, rather than a change to a custom configuration. But either way it can be done with or without explicit references

The challenge with aggregation is the things that are "global". These include:

There's no general solution for combining these global configs in different standalone projects when aggregating, but maybe some compromise is sufficient to be useful. Bazel has similar issues, and draws an arbitrary line of what works in aggregated projects that works well enough in practice (e.g. BUILD and WORKSPACE files are aggregated, but not .bazelrc).

These are actually similar to the problems involved with treating sub-directories as standalone projects when running mill inside of them. However, this is orthogonal to the question of explicit references between subprojects.

Explicit References Between Subprojects

This is something that different build tools handle differently:

  1. Bazel and its family (Buck, Pants) has no explicit references between subfolder build files
  2. SBT has no explicit references between subproject build files, but optionally you can define things in .scala files and explicitly reference those in your root build.sbt
  3. Gradle has explicit references from the root settings.gradle file on the sub-projects
  4. Maven has explicit references from the root pom.xml file on the sub-projects

So clearly all different approaches can work. I think my current inclination is to go the Bazel/SBT route rather than the Gradle/Maven route. For two reasons:

  1. Unlike handling of the meta-build which is O(1) boilerplate, explicit references to submodules is O(n) boilerplate. This is not a problem when n is small and you only have a few subprojects, but I'd like to try and optimize Mill for when n is large. Maybe not O(1000s) like you see in Bazel projects, but at least O(10s) to O(100s), at which point the boilerplate from the root "registry" importing the subprojects becomes significant (not just lines of code, but also merge conflicts, etc.).

  2. My experience with Bazel is that "put a build file in a subfolder to configure the build for that subfolder" is a very easy workflow for developers to understand. Forcing people to add a reference to a top-level registry maven/gradle-style isn't necessary. Nobody has issues with understanding the "look for enclosing folder with a build file to see where the build config is" workflow, both humans and tools

Existing support for foreign modules

Existing "foreign modules" work, but they cannot serve the purpose of these nested build.sc files: they cannot be referenced from the command-line, and can only be referenced if imported and defined in the root build.sc. While this works, I don't see many people making good use of them: none of com.lihaoyi, scala-cli, coursier, use it, but chipsalliance/chisel does. I think that's for good reason:

  1. There are many ways to organize your foreign modules, and so the relationship between the foreign module code and the subfolders in the projects are arbitrary, which makes it non-obvious how to set them up initially and non-obvious how the read/understand the code later

  2. The boilerplate for importing and registering the foreign module logic in build.sc is tedious and non-obvious. In theory we can already do SBT-style explicit registration of foreign module code in the root build.sc, and it works just as easily as doing so in SBT, but it's so much more work than dumping everything in the same file that people (including myself) don't

In effect, the current way foreign modules and import $files work in Mill is very similar to SBT's "define stuff in Scala files and then use them in the root build.sc" workflow, which I don't think is sufficiently easy. e.g. digging through the chipsalliance/chisel Mill config, it's non-obvious how the code in the various helper traits in their foreign modules correspond to folders on disk, even though I could in theory guess at the naming convention or grep/jump around the references in code and figure it out

My goal for this effort is to solve both these issues:

  1. There should only be one obvious place to put a subfolder build.sc file related to a folder: in that folder. And there should only be one obvious place you go to find the code/config later: in the nearest build.sc file in the enclosing folders

  2. Moving logic into subfolder build.sc files should be easy: create file and copy/paste. In my experience, that workflow in Bazel is a lot easier for non-experts to work with than the workflow in SBT where you need to subsequently register your new logic in the root build.sbt


I only started looking into this a few days ago, so it's still pretty rough. Hopefully it'll become more concrete as the implementation progresses

lefou commented 1 week ago

@lihaoyi Thanks for you thorough explanations.

I think a link from the sub-project to the root or parent project would be the best solution.

The only downside of the sub-project idea as a whole is the lookup which we need to do with every Mill invocation.

Maybe, we can settle with a single enabler-option in the root module, so we only look for sub-projects if it is enabled. (//> using mill.scanSubProjects true).

lihaoyi commented 1 week ago

I do not think we should have a reference from each subproject to the root project. No other build tool does this. Unless we think we know something special that everything else doesn't, or we think we are smarter than the folks developing all the other build tools, there is no reason to innovate on this part of the design. SBT/Bazel/Maven/Gradle all work fine without such references, and everyone seems to have no problems understanding or navigating it.

As for maintaining references from the root project to subprojects, it basically comes down to a choice of following SBT/Bazel (no reference) or following Maven/Gradle (has reference). Either works; again SBT/Bazel/Maven/Gradle all work acceptably. The only question is then which precedence to follow.

In this case, I'd like to follow the Bazel/SBT precedence. Mill was always inspired by Bazel and SBT, and in this case both work the same: just drop a build.bazel/build.sbt file into a subfolder and that's it. Bazel and SBT have many pain points, but the way you create build files in subfolders is not one of them. So again, I don't think we should over-analyze this: there is precedence, and the precedence works well enough, at least in my experience as a professional build-systems engineer for the last decade or so.

It's certainly possible to come up with a build tool inspired by Maven and Gradle instead of inspired by Bazel and SBT, but such a build tool would look very different from Mill.

In general, I think "XXX does this" really has a lot of weight: it means someone else came up with the design as well, that the design has been battle-tested against a wide variety of use cases, that the problems and pitfalls and edge-cases are already known. Coming up with our own bespoke design means a brand new journey of exciting edge cases and design issues, which is not something I'd like to sign up for unless that design is truly part of our unique value proposition. In this case, "how we reference or discover subprojects" is not, and I'd like to pick an off-the-shelf solution that's aligned with Mill's overall design philosophy and origins.

Let's go with a marker import in the root build.sc, to keep the "single source of truth" thing and to delimit the root build file from the others

lefou commented 1 week ago

I'm not saying we are smarter that other tools or tool developers. But we (or I, to speak for me) also have issues with these other tools, others obviously don't have or don't consider as bad enough.

So, why I think we should have that link? When mill is run from a directory, it typically assumes that it's in a (main) project directory. It reads a build.sc and creates a out directory to dump a lot of stuff into it. How can it detect, whether it is a standalone-, a sub- or a root-project? With the current design, it can't.

It can try to search the directory tree up- and downwards and can end up with various situations. It can find build.sc files in both directions, but there is no indicator whether it's meant to accumulate all found projects into one giant one.

Take the typical git worktree add temp example, where you check out a version of the current project into a sub-directory. Both, the newly checked out tree as well as the currently existing project will start to act wild, since any wildcard target selector will start to behave differently. That's exactly a situation into which I would avoid to come at any price and which we definitely avoided in Mill in any past feature decision like e.g. meta-builds.

Maybe, it could work, if we don't plan to provide any convenience for when mill is run from a child project. If we don't ever traverse parent directories and always behave as if the current directory is the root level, then it will be enough to assume as you suggested. But if we want to provide convenience for when mill is run from a child dir (e.g. mill clean in dir root/foo is the same as mill clean foo in dir root), than we should IMHO make that explicit. This is especially important, if it needs to resolve references to other outer modules, but find more than one build.sc files on the root-directory traversal.

E.g. in a work directory with various independent projects, I sometime have a build.sc to have some maintenance task. Now, with this new feature, any mill invocation in any of these projects will assume it's part of a that outer project.

lihaoyi commented 1 week ago

If the concern is about running from subfolders, it seems the issue is largely about being able to distinguish a root build files from a nested build files, so any logic can walk the folder tree upwards and find the correct root, and also ignore unwanted things in subfolders when traversing the folder tree downwards to discover nested build files (esp. other mill builds in subfolders)

There are ways to do that other than via explicit links, Some ideas:

  1. We use a magic import to demarcate the root build.sc file, say import $subprojects._ (or whatever other syntax we come up with). The nearest enclosing folder with the magic import is the root build.sc. This would require adding the magic import to current build.sc files in order to work

  2. We use different filenames for the root and nested build files. E.g. maybe we force people to write build.sc for the root and mill-package.sc for nested ones? This should work out of the box with current build.sc files

  3. We use some kind of marker file. We already have .mill-version and .mill-jvm-opts, combining that into .mill-config and saying you must create a .mill-config file to demarcate a top level Mill project is possible (basically what Bazel does with its WORKSPACE file)

We can probably come up with other options, but of these I think option (2) is pretty ergonomic and easy to understand, while still making it unambiguous what is a root mill project vs what is a subfolder. When traversing upwards to find the enclosing project root, this will mark it unambiguously, and when traversing downwards to find nested build files it would correcty ignore nested projects with their own root build.sc files, without needing any code changes

The only risk is if someone has a top-level mill-package.sc file not wrapped in an enclosing build.sc, but that seems an obscure enough edge case not to cause issues

lefou commented 6 days ago

I think the module.sc is a nice compromise. It means, build.sc needs to strictly reside above each sub-module, which is a limitation. On the other side, we don't need to parse anything before we can detect that we are in a sub-project. So it can be really fast.

Should we explicitly enable sub-module scanning in root projects? Without, we cannot avoid unnecessary lookup scans with each run. I think we should.

lihaoyi commented 6 days ago

Yes I think explicitly enabling it via a flag makes sense. Similar to what we do for the meta build, import $submodules._ or something

lihaoyi commented 5 days ago

Still chasing down some residual issues, but most tests are passing and this is probably ready for a first round of review

lihaoyi commented 5 days ago

I haven't implemented the "run mill command in subfolder of Mill project" logic in this PR; I'm leaving that to a follow up since this PR is already large enough

lefou commented 4 days ago

About the deep module.sc in sub-directories discovery: The suggested approach is to search recursively in all sub-directories. It can be therefore hard for a developer unfamiliar with a project to get a quick idea of the project layout.

What about just considering one level of sub-directories? This would make discovery for Mill as well as a human faster. I think most projects would already match that limited layout.

We could also discover sub-directories of detected sub-modules. Due to the uniform mapping of sub-directories to sub-modules, we could use empty module.sc on the sub-path, if we want the discovery to go on deeper levels. As a consequence, each deeply nested sub-module leaves a trace (of moudule.sc files) in the directory tree traversal.

lihaoyi commented 4 days ago

@lefou what you suggest sounds reasonable. It would also prevent "accidental" picking up of random module.sc files deep inside the filesystem hierarchy

On the other hand, most platforms I'm aware of don't do that. Bazel/Buck/Pants allow arbitrarily deeply placed BUILD files that are picked up by recursive search. Python used to require __init__.py files in every enclosing folder, and eventually got rid of them in Python 3.3+ because they were too error prone and boilerplatey. JVM languages typically do a recursive search on the source folders and pick up everything inside without needing a package-info.java or package.scala file in each enclosing folder for discovery to count.

This makes me lean towards doing deep recursive discovery, rather than requiring empty module.scs as markers

lihaoyi commented 4 days ago

@lolgab what do you think of restricting target and module definitions to build.sc/module.sc files only, and not allowing them to be defined in random foo.sc scripts?

You implemented the status quo of foreign modules, so you probably use them the most and are most familiar with their use cases

lefou commented 4 days ago

For me, foreign modules never proved to be usable and I'm also not aware of any project that uses them. I'm all for removing them. Instead, we should think about a lightweight way to use/aggregate independent projects in a way, that also re-uses their cached target results.

lefou commented 4 days ago

@lihaoyi

@lefou what you suggest sounds reasonable. It would also prevent "accidental" picking up of random module.sc files deep inside the filesystem hierarchy

On the other hand, most platforms I'm aware of don't do that. Bazel/Buck/Pants allow arbitrarily deeply placed BUILD files that are picked up by recursive search.

I don't think our main aim is to concur with existing project using these build tools, since those target rather large projects. If we try to be an alternative to Maven/Gradle/sbt/ant, we don't clash with any expectations, if we don't recursively lookup modules. But especially in large projects, non-recursive scanning can make use faster.

Python used to require __init__.py files in every enclosing folder, and eventually got rid of them in Python 3.3+ because they were too error prone and boilerplatey.

Isn't the __init__.py file a non-empty parsed file, whereas we would try to support an empty module.sc files, which should not cause any additional boilerplate, except it's marker function. If non-empty, it will typically also contain module definitions. So I don't think this is that a strong argument.

JVM languages typically do a recursive search on the source folders and pick up everything inside without needing a package-info.java or package.scala file in each enclosing folder for discovery to count.

We do support real JVM behavior below the src directory, once you enable the meta-build. The build.sc/module.sc are meant to be processed by Mill and understood by the build maintainer, which is for all to-date build tools a hard job, and only a handful of developers like to know/understand it's details. We should make it as easy as possible to understand a project layout (since it is important to enable all developers to own the build).

Since projects can contain deeply nested directories, e.g. with test data and also sources and resources, by omitting these for traversal by default, we may save a lot of filesystem IO.

This makes me lean towards doing deep recursive discovery, rather than requiring empty module.scs as markers

Lastly, we could let the user decide. When we introduce a flag to enable scanning, instead of on/off, we could make it a no/flat/recursive instead. It's not my preferred solution though.

lihaoyi commented 4 days ago

module.sc as a marker is more or less identical to __init__.py: a place where you can put stuff, and a marker that something is a folder that might contain other things you care about. People created empty __init__.py files all the time, similar to how people may need to create empty module.sc files

The point about avoiding walking the entire filesystem every time is valid. Unless walking src folders for Java or Scala files, Mill would have to walk your entire project directory recursively. That's a lot of walking, and if there are some random folders full of small files (e.g. node_modules/) that walking could be very expensive. Making it gitignore-aware would help somewhat, but would add a bunch of complexity over simply not walking the entire filesystem tree.

lolgab commented 4 days ago

About removing current foreign modules support, I think it's better to release this new feature in Mill 0.11 since it's binary compatible, and deprecate usages of foreign modules (log a warning when importing a foreign module with import $file., otherwise we would need to wait for 0.12 to release something not really tested in the wild which would need Mill 0.13 to be changed.

About recursive search: I think it's better to limit the search to subdirectories of directories containing a module.sc file. I don't think projects have very deep module structures, one I see very often is:

./modules/module1
./modules/module2
./modules/module3

where they would only need to create an extra module.sc in modules.

On the other end, source code tend to have very deep directory structures, like:

/module/test/src/com/company/project/foo/bar

with hundreds of directory, and it seems wasteful and weird to automatically add to the build a file in a folder so deep in the file-system.

Haven't used Python a lot but it seems to me that the comparison with __init__.py files is not in one-to-one relationship with Mill modules, since you can have many many scala packages in the same module. Having to add a file for every Scala package in a project would be very annoying, but one per directory in a deep module structure, not so much. Moreover, I would expect most people to have at most 2 levels of modules.

One thing I would ask for this feature, is to implement some guardrails to avoid name clashes. When object foo in build.sc and ./foo/module.sc are defined at the same time, I would expect Mill to fail with an error message mentioning the clash.

lihaoyi commented 4 days ago

About removing current foreign modules support, I think it's better to release this new feature in Mill 0.11 since it's binary compatible, and deprecate usages of foreign modules (log a warning when importing a foreign module with import $file., otherwise we would need to wait for 0.12 to release something not really tested in the wild which would need Mill 0.13 to be changed.

I don't think it's realistic to release this PR as part of 0.11.x. It's binary compatible according to mima, but there's a bunch of ways in which the generated code can result in incompatibility. Furthermore, if we want to clean up the existing semantics further as part of the breakage, that will involve further changes in the desugaring.

I don't see a path forward for testing out the new feature in a backwards compatible way when the whole point of it is to overhaul the existing semantics. Realistically, this will need to go into a 0.12.x, and any testing will need to be done on RCs or milestones. Dogfooding the 0.12.0-RCs on the com-lihaoyi projects and other projects we own should give us confidence that things generally still work

lefou commented 4 days ago

We use the same version for Mill releases and its supportive high-level abstractions like scalalib. Technically, if we don't change scalalib, we could easily cut a (experimental/preview/unstable) 0.12 line, which is still binary compatible, so that at least for a while, we could keep using our existing plugin ecosystem, with Mill itself getting new and dropping old features that don't affect the API to configure modules.

lefou commented 4 days ago

@lihaoyi @lolgab

What you think of creating a feature-preview-branch, where we merge some (binary-compatible) PRs like this one for early testing?

I think, nothing beats an easily accessible development release, with which we can test features on real projects.

lihaoyi commented 4 days ago

works for me. We can just create a 12.x beanch and merge jnto that

lihaoyi commented 3 days ago

I updated the PR to only traverse folders with module.sc files, and to remove the ability to define modules and targets in other foo.sc files

Let's review this PR as is, but leave it un-merged for now. When we have a number of other breaking PRs ready, we can merge them into master, cut 0.12.0-RC releases for testing, and create a 0.11.x maintenance branch for any PRs that still need to target 0.11.9 and above

lolgab commented 2 days ago

It seems that mill resolve is not printing the modules defined in module.sc files. Is this still to be implemented?

Also, can we add a # TODOs section in the PR and add the check for double definition (foo/module.sc and object foo in build.sc)?

lihaoyi commented 2 days ago

@lolgab probably a missing implementation, will take a look

lefou commented 2 days ago

I think in context of this feature there should be some guidelines and/or restrictions, which magic imports work / should be used in which locations. E.g. A import $ivy worked in build.sc and imported files, so I assume, it also works for module.sc. Do we detect and report conflicts? (if not, we might just see some coursier error message without any hint where it originates from.) A import $meta should only work from a build.sc, but do we ensure it?