Open lihaoyi opened 2 weeks ago
What are the semantics when mill
is run from a sub-directory? Is this handled in any way?
@lefou I haven't thought about that yet. I'm not sure what the best thing to do is. I'm guessing (2) or (3), just because that's what other build tools seem to do
@lefou I haven't thought about that yet. I'm not sure what the best thing to do is. I'm guessing (2) or (3), just because that's what other build tools seem to do
We should remind our single-source-(file)-of-truth concept and make split-up projects explicit.
Instead of looking up (guessing) sub-projects, we should explicitly list them in the parent project. I think we previously handled kind-of sub-projects with import @file
but broke it later, but some import
statement or better some usage directive would be nice. Additionally or alternatively, we could denote the root/parent project in the sub-project, so lookup is faster and more accurate, if we decide to support running from a sub-project is a good idea.
I would strongly recommend to not guess sub-project relations solely on the existence of files. Then we end up with a chaotic and non-reproducible setup (which you can experience with Gradle for example).
With an explicit sub-project configuration, it would be possible to aggregate multiple stand-alone projects (e.g. by pointing to a location outside the current project or some git submodules). As long as sub-projects don't use any resources of the aggregating project, their build results (in out/
) should also be independent and we could just use the separate project-local out
-directories for each aggregated project.
So, simply aggregating multiple Mill projects should not require a rebuild of each single project. But it would make re-use of external project easy and lightweight.
I think there are a few orthogonal things here
This can be done with or without explicit references between subfolder build.sc
files. Arguably it's even simpler without explicit references: aggregation then becomes a matter of a standard filesystem symlink, rather than a change to a custom configuration. But either way it can be done with or without explicit references
The challenge with aggregation is the things that are "global". These include:
mill-build/
meta-build folder.mill-version
or .config/mill-version
.mill-jvm-opts
There's no general solution for combining these global configs in different standalone projects when aggregating, but maybe some compromise is sufficient to be useful. Bazel has similar issues, and draws an arbitrary line of what works in aggregated projects that works well enough in practice (e.g. BUILD
and WORKSPACE
files are aggregated, but not .bazelrc
).
These are actually similar to the problems involved with treating sub-directories as standalone projects when running mill
inside of them. However, this is orthogonal to the question of explicit references between subprojects.
This is something that different build tools handle differently:
.scala
files and explicitly reference those in your root build.sbt
settings.gradle
file on the sub-projectspom.xml
file on the sub-projectsSo clearly all different approaches can work. I think my current inclination is to go the Bazel/SBT route rather than the Gradle/Maven route. For two reasons:
Unlike handling of the meta-build which is O(1) boilerplate, explicit references to submodules is O(n) boilerplate. This is not a problem when n
is small and you only have a few subprojects, but I'd like to try and optimize Mill for when n
is large. Maybe not O(1000s) like you see in Bazel projects, but at least O(10s) to O(100s), at which point the boilerplate from the root "registry" importing the subprojects becomes significant (not just lines of code, but also merge conflicts, etc.).
My experience with Bazel is that "put a build file in a subfolder to configure the build for that subfolder" is a very easy workflow for developers to understand. Forcing people to add a reference to a top-level registry maven/gradle-style isn't necessary. Nobody has issues with understanding the "look for enclosing folder with a build file to see where the build config is" workflow, both humans and tools
Existing "foreign modules" work, but they cannot serve the purpose of these nested build.sc
files: they cannot be referenced from the command-line, and can only be referenced if imported and defined in the root build.sc
. While this works, I don't see many people making good use of them: none of com.lihaoyi, scala-cli, coursier, use it, but chipsalliance/chisel does. I think that's for good reason:
There are many ways to organize your foreign modules, and so the relationship between the foreign module code and the subfolders in the projects are arbitrary, which makes it non-obvious how to set them up initially and non-obvious how the read/understand the code later
The boilerplate for importing and registering the foreign module logic in build.sc
is tedious and non-obvious. In theory we can already do SBT-style explicit registration of foreign module code in the root build.sc
, and it works just as easily as doing so in SBT, but it's so much more work than dumping everything in the same file that people (including myself) don't
In effect, the current way foreign modules and import $file
s work in Mill is very similar to SBT's "define stuff in Scala files and then use them in the root build.sc" workflow, which I don't think is sufficiently easy. e.g. digging through the chipsalliance/chisel Mill config, it's non-obvious how the code in the various helper traits in their foreign modules correspond to folders on disk, even though I could in theory guess at the naming convention or grep/jump around the references in code and figure it out
My goal for this effort is to solve both these issues:
There should only be one obvious place to put a subfolder build.sc
file related to a folder: in that folder. And there should only be one obvious place you go to find the code/config later: in the nearest build.sc
file in the enclosing folders
Moving logic into subfolder build.sc
files should be easy: create file and copy/paste. In my experience, that workflow in Bazel is a lot easier for non-experts to work with than the workflow in SBT where you need to subsequently register your new logic in the root build.sbt
I only started looking into this a few days ago, so it's still pretty rough. Hopefully it'll become more concrete as the implementation progresses
@lihaoyi Thanks for you thorough explanations.
I think a link from the sub-project to the root or parent project would be the best solution.
import $meta._
($meta.mill-build
). We could use a import $parent._
($parent.^
) or a using directive //> using mill.parent ..
..millignore
The only downside of the sub-project idea as a whole is the lookup which we need to do with every Mill invocation.
Maybe, we can settle with a single enabler-option in the root module, so we only look for sub-projects if it is enabled. (//> using mill.scanSubProjects true
).
I do not think we should have a reference from each subproject to the root project. No other build tool does this. Unless we think we know something special that everything else doesn't, or we think we are smarter than the folks developing all the other build tools, there is no reason to innovate on this part of the design. SBT/Bazel/Maven/Gradle all work fine without such references, and everyone seems to have no problems understanding or navigating it.
As for maintaining references from the root project to subprojects, it basically comes down to a choice of following SBT/Bazel (no reference) or following Maven/Gradle (has reference). Either works; again SBT/Bazel/Maven/Gradle all work acceptably. The only question is then which precedence to follow.
In this case, I'd like to follow the Bazel/SBT precedence. Mill was always inspired by Bazel and SBT, and in this case both work the same: just drop a build.bazel
/build.sbt
file into a subfolder and that's it. Bazel and SBT have many pain points, but the way you create build files in subfolders is not one of them. So again, I don't think we should over-analyze this: there is precedence, and the precedence works well enough, at least in my experience as a professional build-systems engineer for the last decade or so.
It's certainly possible to come up with a build tool inspired by Maven and Gradle instead of inspired by Bazel and SBT, but such a build tool would look very different from Mill.
In general, I think "XXX does this" really has a lot of weight: it means someone else came up with the design as well, that the design has been battle-tested against a wide variety of use cases, that the problems and pitfalls and edge-cases are already known. Coming up with our own bespoke design means a brand new journey of exciting edge cases and design issues, which is not something I'd like to sign up for unless that design is truly part of our unique value proposition. In this case, "how we reference or discover subprojects" is not, and I'd like to pick an off-the-shelf solution that's aligned with Mill's overall design philosophy and origins.
Let's go with a marker import in the root build.sc
, to keep the "single source of truth" thing and to delimit the root build file from the others
I'm not saying we are smarter that other tools or tool developers. But we (or I, to speak for me) also have issues with these other tools, others obviously don't have or don't consider as bad enough.
So, why I think we should have that link? When mill
is run from a directory, it typically assumes that it's in a (main) project directory. It reads a build.sc
and creates a out
directory to dump a lot of stuff into it. How can it detect, whether it is a standalone-, a sub- or a root-project? With the current design, it can't.
It can try to search the directory tree up- and downwards and can end up with various situations. It can find build.sc
files in both directions, but there is no indicator whether it's meant to accumulate all found projects into one giant one.
Take the typical git worktree add temp
example, where you check out a version of the current project into a sub-directory. Both, the newly checked out tree as well as the currently existing project will start to act wild, since any wildcard target selector will start to behave differently. That's exactly a situation into which I would avoid to come at any price and which we definitely avoided in Mill in any past feature decision like e.g. meta-builds.
Maybe, it could work, if we don't plan to provide any convenience for when mill
is run from a child project. If we don't ever traverse parent directories and always behave as if the current directory is the root level, then it will be enough to assume as you suggested. But if we want to provide convenience for when mill
is run from a child dir (e.g. mill clean
in dir root/foo
is the same as mill clean foo
in dir root
), than we should IMHO make that explicit. This is especially important, if it needs to resolve references to other outer modules, but find more than one build.sc
files on the root-directory traversal.
E.g. in a work directory with various independent projects, I sometime have a build.sc
to have some maintenance task. Now, with this new feature, any mill
invocation in any of these projects will assume it's part of a that outer project.
If the concern is about running from subfolders, it seems the issue is largely about being able to distinguish a root build files from a nested build files, so any logic can walk the folder tree upwards and find the correct root, and also ignore unwanted things in subfolders when traversing the folder tree downwards to discover nested build files (esp. other mill builds in subfolders)
There are ways to do that other than via explicit links, Some ideas:
We use a magic import to demarcate the root build.sc file, say import $subprojects._
(or whatever other syntax we come up with). The nearest enclosing folder with the magic import is the root build.sc
. This would require adding the magic import to current build.sc
files in order to work
We use different filenames for the root and nested build files. E.g. maybe we force people to write build.sc
for the root and mill-package.sc
for nested ones? This should work out of the box with current build.sc
files
We use some kind of marker file. We already have .mill-version
and .mill-jvm-opts
, combining that into .mill-config
and saying you must create a .mill-config
file to demarcate a top level Mill project is possible (basically what Bazel does with its WORKSPACE
file)
We can probably come up with other options, but of these I think option (2) is pretty ergonomic and easy to understand, while still making it unambiguous what is a root mill project vs what is a subfolder. When traversing upwards to find the enclosing project root, this will mark it unambiguously, and when traversing downwards to find nested build files it would correcty ignore nested projects with their own root build.sc
files, without needing any code changes
The only risk is if someone has a top-level mill-package.sc
file not wrapped in an enclosing build.sc
, but that seems an obscure enough edge case not to cause issues
I think the module.sc
is a nice compromise. It means, build.sc
needs to strictly reside above each sub-module, which is a limitation. On the other side, we don't need to parse anything before we can detect that we are in a sub-project. So it can be really fast.
Should we explicitly enable sub-module scanning in root projects? Without, we cannot avoid unnecessary lookup scans with each run. I think we should.
Yes I think explicitly enabling it via a flag makes sense. Similar to what we do for the meta build, import $submodules._
or something
Still chasing down some residual issues, but most tests are passing and this is probably ready for a first round of review
I haven't implemented the "run mill command in subfolder of Mill project" logic in this PR; I'm leaving that to a follow up since this PR is already large enough
About the deep module.sc
in sub-directories discovery: The suggested approach is to search recursively in all sub-directories. It can be therefore hard for a developer unfamiliar with a project to get a quick idea of the project layout.
What about just considering one level of sub-directories? This would make discovery for Mill as well as a human faster. I think most projects would already match that limited layout.
We could also discover sub-directories of detected sub-modules. Due to the uniform mapping of sub-directories to sub-modules, we could use empty module.sc
on the sub-path, if we want the discovery to go on deeper levels. As a consequence, each deeply nested sub-module leaves a trace (of moudule.sc
files) in the directory tree traversal.
@lefou what you suggest sounds reasonable. It would also prevent "accidental" picking up of random module.sc
files deep inside the filesystem hierarchy
On the other hand, most platforms I'm aware of don't do that. Bazel/Buck/Pants allow arbitrarily deeply placed BUILD
files that are picked up by recursive search. Python used to require __init__.py
files in every enclosing folder, and eventually got rid of them in Python 3.3+ because they were too error prone and boilerplatey. JVM languages typically do a recursive search on the source folders and pick up everything inside without needing a package-info.java
or package.scala
file in each enclosing folder for discovery to count.
This makes me lean towards doing deep recursive discovery, rather than requiring empty module.sc
s as markers
@lolgab what do you think of restricting target and module definitions to build.sc
/module.sc
files only, and not allowing them to be defined in random foo.sc
scripts?
That would significantly simplify the desugaring, leaving foo.sc
scripts as pure helpers without needing to provide all the foreign module metadata etc. as part of the Scala wrapper.
It would also encourage people to put their targets and modules in build.sc
/module.sc
files, where they can be properly referenced from the command line, rather than in random foo.sc
scripts where they are only reference-able from code but not from CLI
You implemented the status quo of foreign modules, so you probably use them the most and are most familiar with their use cases
For me, foreign modules never proved to be usable and I'm also not aware of any project that uses them. I'm all for removing them. Instead, we should think about a lightweight way to use/aggregate independent projects in a way, that also re-uses their cached target results.
@lihaoyi
@lefou what you suggest sounds reasonable. It would also prevent "accidental" picking up of random
module.sc
files deep inside the filesystem hierarchyOn the other hand, most platforms I'm aware of don't do that. Bazel/Buck/Pants allow arbitrarily deeply placed
BUILD
files that are picked up by recursive search.
I don't think our main aim is to concur with existing project using these build tools, since those target rather large projects. If we try to be an alternative to Maven/Gradle/sbt/ant, we don't clash with any expectations, if we don't recursively lookup modules. But especially in large projects, non-recursive scanning can make use faster.
Python used to require
__init__.py
files in every enclosing folder, and eventually got rid of them in Python 3.3+ because they were too error prone and boilerplatey.
Isn't the __init__.py
file a non-empty parsed file, whereas we would try to support an empty module.sc
files, which should not cause any additional boilerplate, except it's marker function. If non-empty, it will typically also contain module definitions. So I don't think this is that a strong argument.
JVM languages typically do a recursive search on the source folders and pick up everything inside without needing a
package-info.java
orpackage.scala
file in each enclosing folder for discovery to count.
We do support real JVM behavior below the src
directory, once you enable the meta-build. The build.sc
/module.sc
are meant to be processed by Mill and understood by the build maintainer, which is for all to-date build tools a hard job, and only a handful of developers like to know/understand it's details. We should make it as easy as possible to understand a project layout (since it is important to enable all developers to own the build).
Since projects can contain deeply nested directories, e.g. with test data and also sources and resources, by omitting these for traversal by default, we may save a lot of filesystem IO.
This makes me lean towards doing deep recursive discovery, rather than requiring empty
module.sc
s as markers
Lastly, we could let the user decide. When we introduce a flag to enable scanning, instead of on/off, we could make it a no/flat/recursive
instead. It's not my preferred solution though.
module.sc
as a marker is more or less identical to __init__.py
: a place where you can put stuff, and a marker that something is a folder that might contain other things you care about. People created empty __init__.py
files all the time, similar to how people may need to create empty module.sc
files
The point about avoiding walking the entire filesystem every time is valid. Unless walking src
folders for Java or Scala files, Mill would have to walk your entire project directory recursively. That's a lot of walking, and if there are some random folders full of small files (e.g. node_modules/
) that walking could be very expensive. Making it gitignore-aware would help somewhat, but would add a bunch of complexity over simply not walking the entire filesystem tree.
About removing current foreign modules support, I think it's better to release this new feature in Mill 0.11
since it's binary compatible, and deprecate usages of foreign modules (log a warning when importing a foreign module with import $file.
, otherwise we would need to wait for 0.12
to release something not really tested in the wild which would need Mill 0.13
to be changed.
About recursive search:
I think it's better to limit the search to subdirectories of directories containing a module.sc
file.
I don't think projects have very deep module structures, one I see very often is:
./modules/module1
./modules/module2
./modules/module3
where they would only need to create an extra module.sc
in modules
.
On the other end, source code tend to have very deep directory structures, like:
/module/test/src/com/company/project/foo/bar
with hundreds of directory, and it seems wasteful and weird to automatically add to the build a file in a folder so deep in the file-system.
Haven't used Python a lot but it seems to me that the comparison with __init__.py
files is not in one-to-one relationship with Mill modules, since you can have many many scala packages in the same module. Having to add a file for every Scala package in a project would be very annoying, but one per directory in a deep module structure, not so much. Moreover, I would expect most people to have at most 2 levels of modules.
One thing I would ask for this feature, is to implement some guardrails to avoid name clashes.
When object foo
in build.sc
and ./foo/module.sc
are defined at the same time, I would expect Mill to fail with an error message mentioning the clash.
About removing current foreign modules support, I think it's better to release this new feature in Mill 0.11 since it's binary compatible, and deprecate usages of foreign modules (log a warning when importing a foreign module with import $file., otherwise we would need to wait for 0.12 to release something not really tested in the wild which would need Mill 0.13 to be changed.
I don't think it's realistic to release this PR as part of 0.11.x. It's binary compatible according to mima, but there's a bunch of ways in which the generated code can result in incompatibility. Furthermore, if we want to clean up the existing semantics further as part of the breakage, that will involve further changes in the desugaring.
I don't see a path forward for testing out the new feature in a backwards compatible way when the whole point of it is to overhaul the existing semantics. Realistically, this will need to go into a 0.12.x, and any testing will need to be done on RCs or milestones. Dogfooding the 0.12.0-RCs on the com-lihaoyi
projects and other projects we own should give us confidence that things generally still work
We use the same version for Mill releases and its supportive high-level abstractions like scalalib
. Technically, if we don't change scalalib, we could easily cut a (experimental/preview/unstable) 0.12
line, which is still binary compatible, so that at least for a while, we could keep using our existing plugin ecosystem, with Mill itself getting new and dropping old features that don't affect the API to configure modules.
@lihaoyi @lolgab
What you think of creating a feature-preview-branch, where we merge some (binary-compatible) PRs like this one for early testing?
I think, nothing beats an easily accessible development release, with which we can test features on real projects.
works for me. We can just create a 12.x beanch and merge jnto that
I updated the PR to only traverse folders with module.sc
files, and to remove the ability to define modules and targets in other foo.sc
files
Let's review this PR as is, but leave it un-merged for now. When we have a number of other breaking PRs ready, we can merge them into master, cut 0.12.0-RC releases for testing, and create a 0.11.x maintenance branch for any PRs that still need to target 0.11.9 and above
It seems that mill resolve
is not printing the modules defined in module.sc
files. Is this still to be implemented?
Also, can we add a # TODOs
section in the PR and add the check for double definition (foo/module.sc
and object foo
in build.sc
)?
@lolgab probably a missing implementation, will take a look
I think in context of this feature there should be some guidelines and/or restrictions, which magic imports work / should be used in which locations. E.g. A import $ivy
worked in build.sc
and imported files, so I assume, it also works for module.sc
. Do we detect and report conflicts? (if not, we might just see some coursier error message without any hint where it originates from.) A import $meta
should only work from a build.sc
, but do we ensure it?
This PR implements support for per-subfolder
build.sc
files, namedmodule.sc
. This allows large builds to be split up into multiplebuild.sc
files each relevant to the sub-folder that they are placed in, rather than having a multi-thousand-linebuild.sc
in the root configuring Mill for the entire codebase.Semantics
The
build.sc
in the project root and all nestedmodule.sc
files are recursively within the Mill directory tree are discovered (TODO ignoring files listed in.millignore
), and compiled togetherWe ignore any subfolders that have their own
build.sc
file, indicating that they are the root of their own project and not part of the enclosing folder.Each
foo/bar/qux/build.sc
file is compiled into amillbuild.foo.bar.qux
package object
, with thebuild.sc
andmodule.sc
files being compiled into amillbuild
package object
(rather than a plainobject
in the status quo)An
object blah extends Module
within eachfoo/bar/qux/build.sc
file can be referenced in code viafoo.bar.qux.blah
, or referenced from the command line viafoo.bar.qux.blah
Design
Uniform vs Non-uniform hierarchy
One design decision here is whether a
module.sc
file in a subfolderfoo/bar/
containingobject qux{ def baz }
would have their targets referenced viafoo.bar.qux.baz
syntax, or via some alternative e.g.foo/bar/qux.baz
.A non-uniform hierarchy
foo/bar/qux.baz
would be similar to how Bazel treats folders v.s. targets non-uniformlyfoo/bar:qux-baz
, and also similar to how external modules in Mill are handled e.g.mill.idea.GenIdea/idea
, as well as existing foreign modules. However, it introduces significant user-facing complexity:foo/bar/qux.baz
vsfoo/bar.qux.baz
orfoo/bar/qux/baz
?module.sc
files rather than just the top-level one e.g.__.compile
?module.sc
modules and targets in Scala code as well?Bazel has significant complexity to handle these cases, e.g. query via
...
vs:all
vs*
. It works, but it does complicate the user-facing semantics.The alternative of a uniform hierarchy also has downsides:
foo.bar.qux.baz
to thebuild.sc
ormodule.sc
file in which it is defined?build.sc
and in a nestedmodule.sc
, what happens?I decided to go with a uniform hierarchy where everything, both in top-level
build.sc
and in nestedmodule.sc
, end up squashed together in a single uniformfoo.bar.qux.baz
hierarchy.Package Objects
The goal of this is to try and make modules defined in nested
module.sc
files "look the same" as modules defined in the rootbuild.sc
. There are two possible approaches:Splice the source code of the various nested
module.sc
files into the top-levelobject build
. This is possible, but very complex and error prone. Especially when it comes to reporting proper error locations in stack traces (filename/linenumber), this will likely require a custom compiler plugin similar to theLineNumberPlugin
we have todayConvert the
object
s intopackage object
s, such that module tree defined in the rootbuild.sc
becomes synonymous with the JVM package tree. While thepackage object
s will cause the compiler to synthesizeobject package { ... }
wrappers, that is mostly hidden from the end user.I decided to go with (2) because it seemed much simpler, making use of existing language features rather than trying to force the behavior we want using compiler hackery. Although
package object
s may go away at some point in Scala 3, they should be straightforward to replace with explicitexport foo.*
statements when that time comes.Existing Foreign Modules
Mill already supports existing
foo.sc
files which support targets and modules being defined within them, but does not support referencing them from the command line.I have removed the ability to define targets and modules in random
foo.sc
files. We should encourage people to put things inmodule.sc
, since that would allow the user to co-locate the build logic within the folder containing the files it is related to, rather than as a bunch of loosefoo.sc
scripts. Removing support for modules/targets infoo.sc
files greatly simplifies the desugaring of these scripts, and since we are already making a breaking change by overhauling how per-foldermodule.sc
files work we might as well bundle this additional breakage together (rather than making multiple breaking changes in series)build.sc
/module.sc
file discoveryFor this implementation, I chose to make
module.sc
files discovered automatically by traversing the filesystem: we recursively walk the subfolders of the rootbuild.sc
project, look for any files namedmodule.sc
. We only traverse folders withmodule.sc
files to avoid having to traverse the entire filesystem structure every time. Emptymodule.sc
files can be used as necessary to allowmodule.sc
files to be placed deeper in the folder treeThis matches the behavior of Bazel and SBT in discovering their
BUILD
/build.sbt
files, and notably goes against Maven/Gradle which require submodules/subprojects to be declared in the top level build config.This design has the following characteristics:
In future, if we wish to allow
mill
invocations from within a subfolder, the distinction betweenbuild.sc
andmodule.sc
allows us to easily find the "enclosing" project root.It ensures that any folders containing
build.sc
/module.sc
files that accidentally get included within a Mill build do not end up getting picked up and confusing the top-level build, because we automatically skip any subfolders containingbuild.sc
Similarly, it ensures that adding a
build.sc
file "enclosing" an existing project, it would not affect Mill invocations in the inner project, because we only walk to the nearest enclosingbuild.sc
file to find the project rootWe do not automatically traverse deeply into sub-folders to discover
module.sc
files, which means that it should be almost impossible to accidentally pick upmodule.sc
files that happen to be on the filesystem but you did not intend to include in the buildThis mechanism should do the right thing 99.9% of the time. For the last 0.1% where it doesn't do the right thing, we can add a
.millignore
/.config/millignore
file to support ignoring things we don't want picked up, but I expect that should be a very rare edge caseCompatibility
This change is binary compatible, but the change in the
.sc
file desugaring is invasive enough we should consider it a breaking changePull request: https://github.com/com-lihaoyi/mill/pull/3213