Open sequencer opened 1 year ago
We're open for discussion and development to support such scenarios. We discussed some aspects of it in the past and I'd like to reference it here FYI:
Although mill.api.PathRef
is used in targets, it's not necessarily limited to it. Also, it is not limited to paths below T.workspace
, so simply changing to a relative path may not work. I already wrote up an idea to let PathRef
s support relative paths, which may work (see https://github.com/com-lihaoyi/mill/discussions/1400#discussioncomment-978572), but it was under the context, that we actively distribute work to other nodes. This is a bit different from your use case.
Yes I just checked those infos befor submitting this issue. Let me know what I can contribute to make this happening.
Actually, for the build script trigger reevaluating issue, I wonder if it’s possible to manually give each Module a version to reduce this behavior, which may give the chance to throw rebuild burden to users.
Maybe another idea:
When walking the directory subtree rooted at p
(the argument path
in PathRef.apply), use each file's relative path to p
to update the digest.
The reasoning is that if two directory tree has the same shape and the same content, we should consider them the "same" directory in the sense of caching. In this implementation, moving the entire subtree p
to another location won't change its hash.
another(maybe unrelative) question is: Is possible to get the full reason why a target is reevaluated, I thought when input is same caching should work, but have no idea to know why input is changed in some complex cases.
Well, currently PathRef
is some kind of blackbox. Once there is one bit different, it's completely different. We probably could model that differently (e.g. a tree-like structure, so we also detect changes in parts of it or can re-use sub-parts) , but I think it's probably not efficient enough. Both, speed and memory wise.
So splitting it structurally on the smallest part might be overkill. But as @sequencer suggested, just keeping the info stable relative to T.workspace
(if possible) might already be good enough, or at least a good start. We could also try to keep the path-info separated from the content hash, so that the proof whether two PathRef
s are of equal content is possible, even when the containing path is different. That's probably what @CircuitCoder meant? In Mill, we are interested in both information, but not necessarily so closely coupled.
I think some experiments or creating a POC wouldn't hurt and could be fun. Got for it!
About the other issue, tracking the origin of a change, this should be probably discussed separately. But once, we invent some tracking, the whole user experience might change. And I fear not to the better. It might result in more complicated API for Input/Output types, for example. That's just a feeling though. Currently, hash-based change detection isn't easy to track (although Mill evaluator knows exactly which tasks were out of date, see out/profile.json
), but it's damn fast and easy to apply.
I'm trying to get minimal implementations of both approaches tested:
PathRef.rel(base: os.Path, path: os.RelPath, quick: boolean)
(https://github.com/CircuitCoder/mill-scala/commit/37cafd741639d3cc218d91e4edb6c0a89ed8ec01). Also, the default ScalaModule.allSourceFiles
was changed to PathRef.rel
for testing purposes. (CI)Path.apply
into calculating digest with relative paths (https://github.com/CircuitCoder/mill-scala/commit/986fa0c2674e4fab32d6f6ec4ebccc75d3c61b58, CI)It seems that the CI still needs a few hours to complete. Meanwhile, I'll try to check if these changes allow us to move cache directories.
UPDATE: a test failed because the repo name is different. Unfortunately, I already have a repo named mill, so cannot fork the repo using that name. I'll try to fork it into another organization and rerun CI.
UPDATE: a test failed because the repo name is different. Unfortunately, I already have a repo named mill, so cannot fork the repo using that name. I'll try to fork it into another organization and rerun CI.
All build and test jobs should not depend on the repo name. Only jobs, that in some way do release or publishing are bound to the repo name. You can also open a draft PRs, then the appropriate CI jobs will run on Mill repo.
Maybe, it's necessary to have some predefined, named path anchors, from which we can start digesting. workspace
might be one of them. coursier-cache
could be another one. (In general, I really like the idea of having some easy possibility to decouple the out
directory from the project directory. E.g. In autotools/make
land, you can build from another location. If we can accomplish something equal in Mill, we can easily build from read-only source trees or put all build output into some RAM-backed storage. Just thinking.) These anchors are probably a bit soft and Mill decides their real path at startup time. It can also be necessary, to ditch the quick
optimization of PathRef
, which we use for coursier dependencies primarily.
What I have been dreaming that PathRef
can only store the hash of a file, and add an new API to access file system.
We probably need to try to resolve the PathRef
against some predefined list of directories, so we can store that virtual name and a sub-path instead. Some alg like that:
// pseudo code
// defined at Mill startup time
val knownLocs: Seq[(PathName, Path)] = Seq(
"out" -> outPath,
"workspace" -> T.workspace,
"coursier-cache" -> cachePath
)
// called at PathRef creation time
def createPathRef(path: Path): PathRef = {
knownLocs.find { case (name, prefix) => path isSubPathOf prefix } match {
case Some((name, prefix)) => // create a PathRef with sub-path and virtual base path
case None => // create a non-portable PathRef with absolute path
}
}
PathRef
comparision is than still based on content hash and the resolved path.
Just some mockup
$ mill show main.jar
"vref:843cb117:out:main/jar.dest/out.jar"
$ mill show main.sources
[
"vref:2318a653:workspace:main/src"
]
I experimented with this a bit. In my first iteration, I made the PathRef.sig
independent of the associated path. I also implemented JSON pickling which takes some context into account, so the actual paths are relative to the specified context. Unfortunately, this alone will not automatically make the Mill cache distributabe, but at least I learned some things.
I will open a PR with the decoupling of PathRef.sig
from the path soon, as it is probably a nice feature to have without any disadvantages (I think). It will not contain the context path feature though. (Edit: this is the PR: https://github.com/com-lihaoyi/mill/pull/2106)
In addition to a PathRef
, which transports the information of a path and a content signature, we probably also need more types. One for a content tree (the .sig
part of PathRef
), but which hashCode
does not change for different path
s. With such a type we could refer to files/directories, where we are only interested in the content and potentially the relative file names. This might be proper input for compilers then. We need a good name though, e.g. TreeRef
or ContentRef
.
Additionally, we need an type to transport a path disregarding the content. This is almost a thin wrapper around os.Path
, but we want to make it context aware. It's essentially a context plus a relative path. This one is needed to refer to actual files, but swap the context depending on the actual runtime location. One usage might be the compiler analysis file of Zinc, which we currently hold in CompileResult.analysisFile
. We need a good name too. ContextPath
?
The current PathRef
could be then constructed from these two, as it depends on the actual path (which should be aware of a context) and the content signature.
To make the Mill cache distributable, we also need to refactor some targets, e.g. a compile target should not depend on targets that return PathRef
s but ContentRef
s instead. Same for the CompileResult
. We probably need to review the whole scalalib
architecture under this perspective.
SubPathRef
might be a good name (for ContentRef
).
Do exist cases when paths in the .json
files in out
are not subpaths of T.workspace
?
Could we process all the os.Path
s and assume them to be relative to T.workspace
? Probably not, asking just to validate this idea.
Currently, out
is hard-coded, so your assumption is valid. But I'm planning to change that, or at least make it overridable via config or cli, to support more use cases like caching or memory-backed storage or read-only project directories.
@lolgab To your second question: We already store paths that are located outside of the T.workspace
, e.g. coursier and ivy artifacts. Making these relative to T.workspace
has more potential to cause harm than good.
PathRef.sig
is also storing information about permissions. As these have different structure under Windows vs. Unix-based system, we probably won't be able to share between different OSes without further changes.
I’m using mill to manage a relative big system. A usage is forking entire workspace , including 'out', sending to other machines, executed some thing else, fork and send again and again. This is used for distributing works to different nodes. however due to the absolute path in
PathRef
,out
cannot be directly packaged and sent away. Thus I’m proposing to make thepath
inPathRef
relative toT.ctx.workspace
to reduce the dependency to environment out of build context.